Re: Are there roadblocks to creating custom DocRouter implementations?

2017-05-16 Thread Dorian Hoxha
Also interested in custom/pluggable routing.

On Tue, May 16, 2017 at 4:47 PM, Erick Erickson 
wrote:

> Hmmm, would the functionality be served by just using implicit routing
> putting the logic in creating the doc and populating the route field?
> Not, perhaps, as elegant as having some kind of pluggable routing I
> grant.
>
> Best,
> Erick
>
> On Tue, May 16, 2017 at 7:31 AM, Shawn Heisey  wrote:
> > There was a question in the #solr IRC channel about creating a custom
> > document router to assign documents to shards based on geolocation data.
> >
> > Looking into this, I think I see a roadblock or two standing in the way
> > of users creating custom router implementations.
> >
> > The "routerMap" field in the DocRouter class is private, and its
> > contents are not dynamically created.  It appears that only specific
> > names (null, plain, implicit, compositeId) are added to the map.
> >
> > I'm thinking that if we make routerMap protected (or create protected
> > access methods), and put "static { }" code blocks in each implementation
> > that add themselves to the parent routerMap, it will be much easier for
> > a user to create their own implementation and have it automatically
> > available to use in a CREATE action.
> >
> > Is this worth an issue in Jira?
> >
> > Thanks,
> > Shawn
> >
>


Re: knowing which fields were successfully hit

2017-05-16 Thread Dorian Hoxha
Something like elasticsearch named-queries, right
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-named-queries-and-filters.html
?


On Tue, May 16, 2017 at 7:10 PM, John Blythe  wrote:

> sorry for the confusion. as in i received results due to matches on field x
> vs. field y.
>
> i've gone w a highlighting solution for now. the fact that it requires
> field storage isn't yet prohibitive for me, so can serve well for now. open
> to any alternative approaches all the same
>
> thanks-
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Tue, May 16, 2017 at 11:37 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
>
> > what do you mean "hit?" As in the user clicked it?
> >
> > On Tue, May 16, 2017 at 11:35 AM, John Blythe 
> wrote:
> >
> > > hey all. i'm sending data out that could represent a purchased item or
> a
> > > competitive alternative. when the results are returned i'm needing to
> > know
> > > which of the two were hit so i can serve up the *other*.
> > >
> > > i can make a blunt instrument in the application layer to simply look
> > for a
> > > match between the queried terms and the resulting fields, but the
> problem
> > > of fuzzy matching and some of the special analysis being done to get
> the
> > > hits will be for naught.
> > >
> > > cursory googling landed me at a similar discussion that suggested using
> > hit
> > > highlighting or retrieving the debuggers explain data to sort through.
> > >
> > > is there another, more efficient means or are these the two tools in
> the
> > > toolbox?
> > >
> > > thanks!
> > >
> >
>


Re: TrieIntField vs IntPointField performance only for equality comparison (no range filtering)

2017-05-16 Thread Dorian Hoxha
Hi Shawn,

I forgot that legacy-int-fields were deprecated. Point fields it is then.

Thanks,
Dorian

On Tue, May 16, 2017 at 3:01 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/16/2017 3:33 AM, Dorian Hoxha wrote:
> > Has anyone measured which is more efficient/performant between the 2
> > intfields if we don't need to do range-checking ? (precisionStep=0)
>
> Point field support in Solr is *BRAND NEW*.  Very little information is
> available yet on the Solr implementation.  Benchmarks were done at the
> Lucene level, but I do not know what the numbers were.  If any Solr
> benchmarks were done, which I can't be sure about, I do not know where
> the results might be.
>
> Lucene had Points support long before Solr did.  The Lucene developers
> felt so strongly about the superiority of the Point implementations that
> they completely deprecated the legacy numeric field classes (which is
> what Trie classes use) early in the 6.x development cycle, slating them
> for removal in 7.0.
>
> If you wonder about backward compatibility in Solr 7.0 because the
> Lucene legacy numerics are disappearing, then you've discovered a
> dilemma that we're facing before the 7.0 release.
>
> Thanks,
> Shawn
>
>


TrieIntField vs IntPointField performance only for equality comparison (no range filtering)

2017-05-16 Thread Dorian Hoxha
Hi,

Has anyone measured which is more efficient/performant between the 2
intfields if we don't need to do range-checking ? (precisionStep=0)

Regards,
Dorian


Re: Atomic Updates

2017-04-26 Thread Dorian Hoxha
@Chris,
According to doc-link-above, only INC,SET are in-place-updates. And only
when they're not indexed/stored, while your 'integer-field' is. So still
shenanigans in there somewhere (docs,your-code,your-test,solr-code).

On Thu, Apr 27, 2017 at 2:04 AM, Chris Ulicny <culicny@iq.media> wrote:

> That's probably it then. None of the atomic updates that I've tried have
> been on TextFields. I'll give the TextField atomic update to verify that it
> will clear the other field.
>
> Has this functionality been consistent since atomic updates were
> introduced, or is this a side effect of some other change? It'd be very
> convenient for us to use this functionality as it currently works, but if
> it's something that prevents us from upgrading versions in the future, we
> should probably avoid expecting it to work.
>
> On Wed, Apr 26, 2017 at 7:36 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
> > > Hmm, interesting. I can imagine that as long as you're updating
> > > docValues fields, the other_text field would be there. But the instant
> > > you updated a non-docValues field (text_field in your example) the
> > > other_text field would disappear
> >
> > I can confirm this. When in-place updates to DV fields are done, the rest
> > of the fields remain as they were.
> >
> > On Thu, Apr 27, 2017 at 4:33 AM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> > > Hmm, interesting. I can imagine that as long as you're updating
> > > docValues fields, the other_text field would be there. But the instant
> > > you updated a non-docValues field (text_field in your example) the
> > > other_text field would disappear.
> > >
> > > I DO NOT KNOW this for a fact, but I'm asking people who do.
> > >
> > > On Wed, Apr 26, 2017 at 2:13 PM, Dorian Hoxha <dorian.ho...@gmail.com>
> > > wrote:
> > > > There are In Place Updates, but according to docs they stll shouldn't
> > > work
> > > > in your case:
> > > > https://cwiki.apache.org/confluence/display/solr/
> > > Updating+Parts+of+Documents
> > > >
> > > > On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny <culicny@iq.media>
> > wrote:
> > > >
> > > >> That's the thing I'm curious about though. As I mentioned in the
> first
> > > >> post, I've already tried a few tests, and the value seems to still
> be
> > > >> present after an atomic update.
> > > >>
> > > >> I haven't exhausted all possible atomic updates, but 'set' and 'add'
> > > seem
> > > >> to preserve the non-stored text field.
> > > >>
> > > >> Thanks,
> > > >> Chris
> > > >>
> > > >> On Wed, Apr 26, 2017 at 4:07 PM Dorian Hoxha <
> dorian.ho...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > You'll lose the data in that field. Try doing a commit and it
> should
> > > >> > happen.
> > > >> >
> > > >> > On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny <culicny@iq.media>
> > > wrote:
> > > >> >
> > > >> > > Thanks Shawn, I didn't realize docValues were enabled by default
> > > now.
> > > >> > > That's very convenient and probably makes a lot of the schemas
> > we've
> > > >> been
> > > >> > > making excessively verbose.
> > > >> > >
> > > >> > > This is on 6.3.0. Do you know what the first version was that
> they
> > > >> added
> > > >> > > the docValues by default for non-Text field?
> > > >> > >
> > > >> > > However, that shouldn't apply to this since I'm concerned with a
> > > >> > non-stored
> > > >> > > TextField without docValues enabled.
> > > >> > >
> > > >> > > Best,
> > > >> > > Chris
> > > >> > >
> > > >> > > On Wed, Apr 26, 2017 at 3:36 PM Shawn Heisey <
> apa...@elyograg.org
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > On 4/25/2017 1:40 PM, Chris Ulicny wrote:
> > > >> > > > > Hello all,
> > > >> > > > >
> > > >> > > > > Suppose I have the following fields in a document and
> populate
> > > all
> > > >> 4
> > > >> > > > fields
> > > >> > > > > for every document.
> > > >> > > > >
> > > >> > > > > id: uniqueKey, indexed and stored
> > > >> > > > > integer_field: indexed and stored
> > > >> > > > > text_field: indexed and stored
> > > >> > > > > othertext_field: indexed but not stored
> > > >> > > > >
> > > >> > > > > No default values, multivalues, docvalues, copyfields, or
> any
> > > other
> > > >> > > > > properties set.
> > > >> > > >
> > > >> > > > You didn't indicate the Solr version.  In recent Solr
> versions,
> > > most
> > > >> > > > field classes other than TextField have docValues enabled by
> > > default,
> > > >> > > > even if the config is not mentioned on the field, and in those
> > > >> > versions,
> > > >> > > > docValues will take the place of stored if stored is false.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Shawn
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>


Re: Atomic Updates

2017-04-26 Thread Dorian Hoxha
There are In Place Updates, but according to docs they stll shouldn't work
in your case:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

On Wed, Apr 26, 2017 at 10:36 PM, Chris Ulicny <culicny@iq.media> wrote:

> That's the thing I'm curious about though. As I mentioned in the first
> post, I've already tried a few tests, and the value seems to still be
> present after an atomic update.
>
> I haven't exhausted all possible atomic updates, but 'set' and 'add' seem
> to preserve the non-stored text field.
>
> Thanks,
> Chris
>
> On Wed, Apr 26, 2017 at 4:07 PM Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
>
> > You'll lose the data in that field. Try doing a commit and it should
> > happen.
> >
> > On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny <culicny@iq.media> wrote:
> >
> > > Thanks Shawn, I didn't realize docValues were enabled by default now.
> > > That's very convenient and probably makes a lot of the schemas we've
> been
> > > making excessively verbose.
> > >
> > > This is on 6.3.0. Do you know what the first version was that they
> added
> > > the docValues by default for non-Text field?
> > >
> > > However, that shouldn't apply to this since I'm concerned with a
> > non-stored
> > > TextField without docValues enabled.
> > >
> > > Best,
> > > Chris
> > >
> > > On Wed, Apr 26, 2017 at 3:36 PM Shawn Heisey <apa...@elyograg.org>
> > wrote:
> > >
> > > > On 4/25/2017 1:40 PM, Chris Ulicny wrote:
> > > > > Hello all,
> > > > >
> > > > > Suppose I have the following fields in a document and populate all
> 4
> > > > fields
> > > > > for every document.
> > > > >
> > > > > id: uniqueKey, indexed and stored
> > > > > integer_field: indexed and stored
> > > > > text_field: indexed and stored
> > > > > othertext_field: indexed but not stored
> > > > >
> > > > > No default values, multivalues, docvalues, copyfields, or any other
> > > > > properties set.
> > > >
> > > > You didn't indicate the Solr version.  In recent Solr versions, most
> > > > field classes other than TextField have docValues enabled by default,
> > > > even if the config is not mentioned on the field, and in those
> > versions,
> > > > docValues will take the place of stored if stored is false.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>


Re: Atomic Updates

2017-04-26 Thread Dorian Hoxha
You'll lose the data in that field. Try doing a commit and it should happen.

On Wed, Apr 26, 2017 at 9:50 PM, Chris Ulicny  wrote:

> Thanks Shawn, I didn't realize docValues were enabled by default now.
> That's very convenient and probably makes a lot of the schemas we've been
> making excessively verbose.
>
> This is on 6.3.0. Do you know what the first version was that they added
> the docValues by default for non-Text field?
>
> However, that shouldn't apply to this since I'm concerned with a non-stored
> TextField without docValues enabled.
>
> Best,
> Chris
>
> On Wed, Apr 26, 2017 at 3:36 PM Shawn Heisey  wrote:
>
> > On 4/25/2017 1:40 PM, Chris Ulicny wrote:
> > > Hello all,
> > >
> > > Suppose I have the following fields in a document and populate all 4
> > fields
> > > for every document.
> > >
> > > id: uniqueKey, indexed and stored
> > > integer_field: indexed and stored
> > > text_field: indexed and stored
> > > othertext_field: indexed but not stored
> > >
> > > No default values, multivalues, docvalues, copyfields, or any other
> > > properties set.
> >
> > You didn't indicate the Solr version.  In recent Solr versions, most
> > field classes other than TextField have docValues enabled by default,
> > even if the config is not mentioned on the field, and in those versions,
> > docValues will take the place of stored if stored is false.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Return all docs with same last-value when sorting by non-unique-value

2017-04-15 Thread Dorian Hoxha
Say order_by=likes descending, limit(4). And the likes are:::
10,9,8,7,7,7,4,2.
Then we'd get back all 10-7 documents, so 6 docs.
The same thing if they sort in the middle.
It can also have a max-limit, so we don't get too many docs returned.

Makes sense ?



On Sat, Apr 15, 2017 at 8:24 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Not really making sense, no. Could you show an example? Also, you seem
> to imply that after sorting only the documents sorted at the end may
> have same values. What if they have the same values but sort into the
> middle?
>
> Regards,
>Alex
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 15 April 2017 at 08:06, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > Hi friends,
> >
> > Say we're sorting by a non-unique-value, and also have a limit(x). But
> > there are more docs in the end of list(x) that have the same value. Is it
> > possible to return them even if the number of items will be > x ?
> > This will make it possible so I don't have to sort by (non-unique,unique)
> > values.
> >
> > Makes sense ?
> >
> > Thank You,
> > Dorian
>


Return all docs with same last-value when sorting by non-unique-value

2017-04-14 Thread Dorian Hoxha
Hi friends,

Say we're sorting by a non-unique-value, and also have a limit(x). But
there are more docs in the end of list(x) that have the same value. Is it
possible to return them even if the number of items will be > x ?
This will make it possible so I don't have to sort by (non-unique,unique)
values.

Makes sense ?

Thank You,
Dorian


Re: Filtering results by minimum relevancy score

2017-04-12 Thread Dorian Hoxha
@alessandro
Elastic-search has it:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html

On Wed, Apr 12, 2017 at 1:49 PM, alessandro.benedetti 
wrote:

> I am not completely sure that the potential benefit of merging less docs in
> sharded pagination overcomes the additional time needed to apply the
> filtering function query.
> I would need to investigate more in details the frange internals.
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329489.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamic schema memory consumption

2017-04-11 Thread Dorian Hoxha
Here is a small snippet that I copy pated from Shawn Helsey (who is a core
contributor I think, he's good):

> One thing to note:  SolrCloud begins to have performance issues when the
> number of collections in the cloud reaches the low hundreds.  It's not
> going to scale very well with a collection per user or per mailbox
> unless there aren't very many users.  There are people looking into how
> to scale better, but this hasn't really gone anywhere yet.  Here's one
> issue about it, with a lot of very dense comments:
>
> https://issues.apache.org/jira/browse/SOLR-7191


On Tue, Apr 11, 2017 at 9:11 PM, Dorian Hoxha <dorian.ho...@gmail.com>
wrote:

> And this overhead depends on what? I mean, if I create an empty collection
>> will it take up much heap size  just for "being there" ?
>
> Yes. You can search on elastic-search/solr/lucene mailing lists and see
> that it's true. But nobody has `empty` collections, so yours will have a
> schema and some data/segments and translog.
>
> On Tue, Apr 11, 2017 at 7:41 PM, jpereira <jpereira...@gmail.com> wrote:
>
>> The way the data is spread across the cluster is not really uniform. Most
>> of
>> shards have way lower than 50GB; I would say about 15% of the total shards
>> have more than 50GB.
>>
>>
>> Dorian Hoxha wrote
>> > Each shard is a lucene index which has a lot of overhead.
>>
>> And this overhead depends on what? I mean, if I create an empty collection
>> will it take up much heap size  just for "being there" ?
>>
>>
>> Dorian Hoxha wrote
>> > I don't know about static/dynamic memory-issue though.
>>
>> I could not find anything related in the docs or the mailing list either,
>> but I'm still not ready to discard this suspicion...
>>
>> Again, thx for your time
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble
>> .com/Dynamic-schema-memory-consumption-tp4329184p4329367.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Dynamic schema memory consumption

2017-04-11 Thread Dorian Hoxha
>
> And this overhead depends on what? I mean, if I create an empty collection
> will it take up much heap size  just for "being there" ?

Yes. You can search on elastic-search/solr/lucene mailing lists and see
that it's true. But nobody has `empty` collections, so yours will have a
schema and some data/segments and translog.

On Tue, Apr 11, 2017 at 7:41 PM, jpereira <jpereira...@gmail.com> wrote:

> The way the data is spread across the cluster is not really uniform. Most
> of
> shards have way lower than 50GB; I would say about 15% of the total shards
> have more than 50GB.
>
>
> Dorian Hoxha wrote
> > Each shard is a lucene index which has a lot of overhead.
>
> And this overhead depends on what? I mean, if I create an empty collection
> will it take up much heap size  just for "being there" ?
>
>
> Dorian Hoxha wrote
> > I don't know about static/dynamic memory-issue though.
>
> I could not find anything related in the docs or the mailing list either,
> but I'm still not ready to discard this suspicion...
>
> Again, thx for your time
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329367.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamic schema memory consumption

2017-04-11 Thread Dorian Hoxha
What I'm suggesting, is that you should aim for max(50GB) per shard of
data. How much is it currently ?
Each shard is a lucene index which has a lot of overhead. If you can, try
to have 20x-50x-100x less shards than you currently do and you'll see lower
heap requirement. I don't know about static/dynamic memory-issue though.

On Tue, Apr 11, 2017 at 6:09 PM, jpereira <jpereira...@gmail.com> wrote:

> Dorian Hoxha wrote
> > Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a
> > little too much for 3TB of data ?
> > Something like 0.167GB for each shard ?
> > Isn't that too much overhead (i've mostly worked with es but still lucene
> > underneath) ?
>
> I don't have only 3TB , I have 3TB in two tier2 machines, the whole cluster
> is 12 TB :) So what I was trying to explain was this:
>
> NODES A & B
> 3TB per machine , 36 collections * 12 shards (432 indexes) , average heap
> footprint of 60GB
>
> NODES C & D - at first
> ~725GB per machine, 4 collections * 12 shards (48 indexes) , average heap
> footprint of 12GB
>
> NODES C & D - after addding 220GB schemaless data
> ~1TB per machine, 46 collections * 12 shards (552 indexes),  average heap
> footprint of 48GB
>
> So, what you are suggesting is that the culprit for the bump in heap
> footprint is the new collections?
>
>
> Dorian Hoxha wrote
> > Also you should change the heap 32GB->30GB so you're guaranteed to get
> > pointer compression. I think you should have no need to increase it more
> > than this, since most things have moved to out-of-heap stuff, like
> > docValues etc.
>
> I was forced to raise the heap size because the memory requirements
> dramatically raised, hence this post :)
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Dynamic-schema-memory-consumption-tp4329184p4329345.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Filtering results by minimum relevancy score

2017-04-11 Thread Dorian Hoxha
Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti  wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamic schema memory consumption

2017-04-11 Thread Dorian Hoxha
Also you should change the heap 32GB->30GB so you're guaranteed to get
pointer compression. I think you should have no need to increase it more
than this, since most things have moved to out-of-heap stuff, like
docValues etc.

On Tue, Apr 11, 2017 at 12:07 PM, Dorian Hoxha <dorian.ho...@gmail.com>
wrote:

> Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a
> little too much for 3TB of data ?
> Something like 0.167GB for each shard ?
> Isn't that too much overhead (i've mostly worked with es but still lucene
> underneath) ?
>
> Can't you use 1/100 the current number of collections ?
>
>
> On Mon, Apr 10, 2017 at 5:22 PM, jpereira <jpereira...@gmail.com> wrote:
>
>> Hello guys,
>>
>> I manage a Solr cluster and I am experiencing some problems with dynamic
>> schemas.
>>
>> The cluster has 16 nodes and 1500 collections with 12 shards per
>> collection
>> and 2 replicas per shard. The nodes can be divided in 2 major tiers:
>>  - tier1 is composed of 12 machines with 4 physical cores (8 virtual),
>> 32GB
>> ram and 4TB ssd; these are used mostly for direct queries and data
>> exports;
>>  - tier2 is composed of 4 machines with 20 physical cores (40 virtual),
>> 128GB and 4TB ssd; these are mostly for aggregation queries (facets)
>>
>> The problem I am experiencing is that when using dynamic schemas, the Solr
>> heap size rises dramatically.
>>
>> I have two tier2 machines (lets call them A and B) running one Solr
>> instance
>> each with 96GB heap size, with 36 collections totaling 3TB of mainly
>> fixed-schema (55GB schemaless) data indexed in each machine, and the heap
>> consumption is on average 60GB (it peaks at around 80GB and drops to
>> around
>> 40GB after a GC run).
>>
>> On the other tier2 machines (C and D) I was running one Solr instance on
>> each machine with 32GB heap size and 4 fixed schema collections with about
>> 725GB of data indexed in each machine, which took up about 12GB of heap
>> size. Recently I added 46 collections to these machines with about 220Gb
>> of
>> data. In order to do this I was forced to raise the heap size to 64GB and
>> after indexing everything now the machines have an averaged consumption of
>> 48GB (!!!) (max ~55GB, after GC runs ~37GB)
>>
>> I also noticed that when indexed fixed schema data the CPU utilization is
>> also dramatically lower. I have around 100 workers indexing fixed schema
>> data with and CPU utilization rate of about 10%, while I have only one
>> worker for schemaless data with a CPU utilization cost of about 20%.
>>
>> So, I have a two big questions here:
>> 1. Is this dramatic rise in resources consumption when using dynamic
>> fields
>> "normal"?
>> 2. Is there a way to lower the memory requirements? If so, how?
>>
>> Thanks for your time!
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble
>> .com/Dynamic-schema-memory-consumption-tp4329184.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Dynamic schema memory consumption

2017-04-11 Thread Dorian Hoxha
Isn't 18K lucene-indexes (1 for each shard, not counting the replicas) a
little too much for 3TB of data ?
Something like 0.167GB for each shard ?
Isn't that too much overhead (i've mostly worked with es but still lucene
underneath) ?

Can't you use 1/100 the current number of collections ?

On Mon, Apr 10, 2017 at 5:22 PM, jpereira  wrote:

> Hello guys,
>
> I manage a Solr cluster and I am experiencing some problems with dynamic
> schemas.
>
> The cluster has 16 nodes and 1500 collections with 12 shards per collection
> and 2 replicas per shard. The nodes can be divided in 2 major tiers:
>  - tier1 is composed of 12 machines with 4 physical cores (8 virtual), 32GB
> ram and 4TB ssd; these are used mostly for direct queries and data exports;
>  - tier2 is composed of 4 machines with 20 physical cores (40 virtual),
> 128GB and 4TB ssd; these are mostly for aggregation queries (facets)
>
> The problem I am experiencing is that when using dynamic schemas, the Solr
> heap size rises dramatically.
>
> I have two tier2 machines (lets call them A and B) running one Solr
> instance
> each with 96GB heap size, with 36 collections totaling 3TB of mainly
> fixed-schema (55GB schemaless) data indexed in each machine, and the heap
> consumption is on average 60GB (it peaks at around 80GB and drops to around
> 40GB after a GC run).
>
> On the other tier2 machines (C and D) I was running one Solr instance on
> each machine with 32GB heap size and 4 fixed schema collections with about
> 725GB of data indexed in each machine, which took up about 12GB of heap
> size. Recently I added 46 collections to these machines with about 220Gb of
> data. In order to do this I was forced to raise the heap size to 64GB and
> after indexing everything now the machines have an averaged consumption of
> 48GB (!!!) (max ~55GB, after GC runs ~37GB)
>
> I also noticed that when indexed fixed schema data the CPU utilization is
> also dramatically lower. I have around 100 workers indexing fixed schema
> data with and CPU utilization rate of about 10%, while I have only one
> worker for schemaless data with a CPU utilization cost of about 20%.
>
> So, I have a two big questions here:
> 1. Is this dramatic rise in resources consumption when using dynamic fields
> "normal"?
> 2. Is there a way to lower the memory requirements? If so, how?
>
> Thanks for your time!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble
> .com/Dynamic-schema-memory-consumption-tp4329184.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SortingMergePolicy in solr 6.4.2

2017-04-07 Thread Dorian Hoxha
Did you get any update on this ?

On Tue, Mar 14, 2017 at 11:56 AM, Sahil Agarwal 
wrote:

> The SortingMergePolicy does not seem to get implemeted.
>
> The csv file gets indexed without errors. But when I search for a term, the
> results returned are not sorted by Marks.
>
> Following is a toy project in Solr 6.4.2 on which I tried to use
> SortingMergePolicyFactory.
>
> Just showing the changes that I did in the core's config files. Please tell
> me if any other info is needed.
> I used the basic_configs when creating core:
> create_core -c corename -d basic_configs
>
>
> managed-schema
>
> 
> .
> .
> .
>  indexed="true" stored="true"/>  "true" stored="true"/>  stored="true"/>  indexed
> ="true" stored="false"/>  multiValued="true" indexed="true" stored="false"/>  type="long" indexed="false" stored="false"/>  multiValued="false" indexed="true" required="true" stored="true"/>
>
>
> solrconfig.xml
>
> 
> 
> Marks descinner str>
>   org.apache.solr.index.TieredMergePolicyFactory str> 
> ​
> 
> ​
>
> 1.csv
>
> id,Name,Subject,Marks 1,Sahil Agarwal,Computers,1108 2,Ian
> Roberts,Maths,7077 3,Karan Vatsa,English,6092 4,Amit Williams,Maths,3924
> 5,Vani Agarwal,Computers,4263 6,Brenda Gupta,Computers,2309
> .
> .
> ​30 rows​
>
> What can be the problem??
>


Str vs Int/Long uniqueKey field performance

2017-04-03 Thread Dorian Hoxha
Hey friends,

Is there any difference on the index-size/performance of having the
'uniqueKey` field a long vs a string ?
Meaning, does it use a different data-structure ?
Cause I remember elasticsearch always uses a string (since it adds the
#type, which solr doesn't have).

Regards,
Dorian


Re: Performance degradation after upgrading from 6.2.1 to 6.4.1

2017-02-14 Thread Dorian Hoxha
Did you see the other thread ? It looked like a problem with logging.

On Tue, Feb 14, 2017 at 10:52 AM, Henrik Brautaset Aronsen <
henrik.aron...@gmail.com> wrote:

> We are seeing performance degradation on our SolrCloud instances after
> upgrading to 6.4.1.
>
>
> Here are a couple of graphs.  As you can see, 6.4.1 was introduced 2/10
> 1200:
>
>
> https://www.dropbox.com/s/qrc0wodain50azz/solr1.png?dl=0
>
> https://www.dropbox.com/s/sdk30imm8jlomz2/solr2.png?dl=0
>
>
> These are two very different usage scenarios:
>
>
> * Solr1 has constant updates and very volatile data (30 minutes TTL, 20
> shards with no replicas, across 8 servers).  Requests in the 99 percentile
> went from ~400ms to 1000-1500ms. (Hystrix cutoff at 1.5s)
>
>
> * Solr2 is a more traditional instance with long-lived data (updated once a
> day, 24 shards with 2 replicas, across 8 servers).  Requests in the 99
> percentile went from ~400ms to at least 1s. (Hystrix cutoff at 1s)
>
>
> I've been looking around, but cannot really find a reason for the
> performance degradation.  Does any of you have an idea?
>
>
> Cheers,
>
> Henrik
>


Re: solr-user-unsubscribe

2017-01-30 Thread Dorian Hoxha
Come on dude. Just look at instructions. Have a little respect.

On Mon, Jan 30, 2017 at 1:55 PM, Rowe, William - 1180 - MITLL <
william.r...@ll.mit.edu> wrote:

> solr-user-unsubscribe
>
>
>


Commit required after delete ?

2017-01-05 Thread Dorian Hoxha
Hello friends,

Based on what I've read, I think "commit" isn't needed to make deletes
active (like we do with index/update), right ?

Since it just marks an in-memory deleted-id bitmap, right ?

Thank You


Re: Cloud Behavior when using numShards=1

2016-12-27 Thread Dorian Hoxha
I think solr tries itself to load balance. Read this page
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
(preferLocalShards!)

Also please write the query.

tip: fill "send" address after completing email

On Tue, Dec 27, 2016 at 4:31 PM, Dave Seltzer  wrote:

> [Forgive the repeat here, I accidentally clicked send too early]
>
> Hi Everyone,
>
> I have a Solr index which is quite small (400,000 documents totaling 157
> MB) with a query load which is quite large. I therefore want to spread the
> load across multiple Solr servers.
>
> To accomplish this I've created a Solr Cloud cluster with two collections.
> The collections are configured with only 1 shard, but with 3 replicas in
> order to make sure that each of the three Solr servers has all of the data
> and can therefore answer any query without having to request data from
> another server. I use the following command:
>
> solr create -c sf_fingerprints -shards 1 -n fingerprints -replicationFactor
> 3
>
> I use HAProxy to spread the load across the three servers by directing the
> query to the server with the fewest current connections.
>
> However, when I turn up the load during testing I'm seeing some stuff in
> the logs of SERVER1 which makes me question my understanding of Solr Cloud:
>
> SERVER1: HttpSolrCall null:org.apache.solr.common.SolrException: Error
> trying to proxy request for url: http://SERVER3:8983/solr/sf_
> fingerprints/select 
>
> I'm curious why SERVER1 would be proxying requests to SERVER3 in a
> situation where the sf_fingerprints index is completely present on the
> local system.
>
> Is this a situation where I should be using generic replication rather than
> Cloud?
>
> Many thanks!
>
> -Dave
>


Re: ttl on merge-time possible somehow ?

2016-12-20 Thread Dorian Hoxha
On Mon, Dec 19, 2016 at 7:03 PM, Chris Hostetter 
wrote:

>
> : So, the other way this can be made better in my opinion is (if the
> : optimization is not already there)
> : Is to make the 'delete-query' on ttl-documents operation on translog to
> not
> : be forced to fsync to disk (so still written to translog, but no fsync).
> : The another index/delete happens, it will also fsync the translog of the
> : previous 'delete ttl query'.
> : If the server crashes, meaning we lost those deletes because the translog
> : wasn't fsynced to disk, then a thread can run on startup to recheck
> : ttl-deletes.
> : This will make it so the delete-query comes "free" in disk-fsync on
> : translog.
> : Makes sense ?
>
> All updates in Solr operate on both the in memory IndexWriter and the
> (Solr specific) transaction log, and only when a "hard commit" happens is
> the IndexWriter closed (causing segment files to fsync) ... the TTL code
> only does a "soft commit" which should not do any fsyncs on the index.
>
I wasn't talking about "commiting" segments, I'm talking about not fsyncing
the translog on the delete-by-query-for-ttl.

It does function like normals db, meaning you append to log most of the
time, and when doing a commit(checkpoint), you write the segment and 'cut'
the log (so we don't read old log on restart to go to latest state). But
you have to fsync the log to be sure it's on disk.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-18 Thread Dorian Hoxha
On Sun, Dec 18, 2016 at 3:48 PM, GW  wrote:

> Yeah,
>
>
> I'll look at the proxy you suggested shortly.
>
> I've discovered that the idea of making a zookeeper aware app is pointless
> when scripting REST calls right after I installed libzookeeper.
>
> Zookeeper is there to provide the zookeeping for Solr: End of story. Me
> thinks
>
> I believe what really has to happen is: connect to the admin API to get
> status
>
> /solr/admin/collections?action=CLUSTERSTATUS
>
> I think it is more sensible to make a cluster aware app.
>
> 1 name="shards"> name="range">8000-7fffactive name="replicas"> name="core">FrogMerchants_shard1_replica1
> http://10.128.0.2:8983/solr name="node_name">10.128.0.2:8983_solr name="state">active name="leader">true
>
> I can get an array of nodes that have a state of active. So if I have 7
> nodes that are state = active, I will have those in an array. Then I can
> use rand() funtion with an array count to select a node/url to post a json
> string. It would eliminate the need for a load balancer. I think.
>
If you send to random(node), there is high chance(increasing with number of
nodes/shards) that node won't have the leader, so that node will also
redirect it to the leader. What you can do, is compute the hash of the 'id'
field locally. with hash-id you will get shard-id (because each shard has
the hash-range), and with shard, you will find the leader, and you will
find on which node the leader is (cluster-status) and send the request
directly to the leader and be certain that it won't be redirected again
(less network hops).


> //pseudo code
>
> $array_count = $count($active_nodes)
>
> $url_target = rand(0, $array_count);
>
> // creat a function to pull the url   somthing like
>
>
> $url = get_solr_url($url_target);
>
> I have test sever on my bench. I'll spin up a 5 node cluster today, get my
> app cluster aware and then get into some Solr indexes with Vi and totally
> screw with some shards.
>
> If I am correct I will post again.
>
> Best,
>
> GW
>
> On 15 December 2016 at 12:34, Shawn Heisey  wrote:
>
> > On 12/14/2016 7:36 AM, GW wrote:
> > > I understand accessing solr directly. I'm doing REST calls to a single
> > > machine.
> > >
> > > If I have a cluster of five servers and say three Apache servers, I can
> > > round robin the REST calls to all five in the cluster?
> > >
> > > I guess I'm going to find out. :-)  If so I might be better off just
> > > running Apache on all my solr instances.
> >
> > If you're running SolrCloud (which uses zookeeper) then sending multiple
> > query requests to any node will load balance the requests across all
> > replicas for the collection.  This is an inherent feature of SolrCloud.
> > Indexing requests will be forwarded to the correct place.
> >
> > The node you're sending to is a potential single point of failure, which
> > you can eliminate by putting a load balancer in front of Solr that
> > connects to at least two of the nodes.  As I just mentioned, SolrCloud
> > will do further load balancing to all nodes which are capable of serving
> > the requests.
> >
> > I use haproxy for a load balancer in front of Solr.  I'm not running in
> > Cloud mode, but a load balancer would also work for Cloud, and is
> > required for high availability when your client only connects to one
> > server and isn't cloud aware.
> >
> > http://www.haproxy.org/
> >
> > Solr includes a cloud-aware Java client that talks to zookeeper and
> > always knows the state of the cloud.  This eliminates the requirement
> > for a load balancer, but using that client would require that you write
> > your website in Java.
> >
> > The PHP clients are third-party software, and as far as I know, are not
> > cloud-aware.
> >
> > https://wiki.apache.org/solr/IntegratingSolr#PHP
> >
> > Some advantages of using a Solr client over creating HTTP requests
> > yourself:  The code is easier to write, and to read.  You generally do
> > not need to worry about making sure that your requests are properly
> > escaped for URLs, XML, JSON, etc.  The response to the requests is
> > usually translated into data structures appropriate to the language --
> > your program probably doesn't need to know how to parse XML or JSON.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Soft commit and reading data just after the commit

2016-12-18 Thread Dorian Hoxha
There's a very high probability that you're using the wrong tool for the
job if you need 1ms softCommit time. Especially when you always need it (ex
there are apps where you need commit-after-insert very rarely).

So explain what you're using it for ?

On Sun, Dec 18, 2016 at 3:38 PM, Lasitha Wattaladeniya 
wrote:

> Hi Furkan,
>
> Thanks for the links. I had read the first one but not the second one. I
> did read it after you sent. So in my current solrconfig.xml settings below
> are the configurations,
>
> 
>${solr.autoSoftCommit.maxTime:1}
>  
>
>
> 
>15000
>false
>  
>
> The problem i'm facing is, just after adding the documents to solr using
> solrj, when I retrieve data from solr I am not getting the updated results.
> This happens time to time. Most of the time I get the correct data but in
> some occasions I get wrong results. so as you suggest, what the best
> practice to use here ? , should I wait 1 mili second before calling for
> updated results ?
>
> Regards,
> Lasitha
>
> Lasitha Wattaladeniya
> Software Engineer
>
> Mobile : +6593896893
> Blog : techreadme.blogspot.com
>
> On Sun, Dec 18, 2016 at 8:46 PM, Furkan KAMACI 
> wrote:
>
> > Hi Lasitha,
> >
> > First of all, did you check these:
> >
> > https://cwiki.apache.org/confluence/display/solr/Near+
> Real+Time+Searching
> > https://lucidworks.com/blog/2013/08/23/understanding-
> > transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > after that, if you cannot adjust your configuration you can give more
> > information and we can find a solution.
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Sun, Dec 18, 2016 at 2:28 PM, Lasitha Wattaladeniya <
> watt...@gmail.com>
> > wrote:
> >
> >> Hi furkan,
> >>
> >> Thanks for your reply, it is generally a query heavy system. We are
> using
> >> realtime indexing for editing the available data
> >>
> >> Regards,
> >> Lasitha
> >>
> >> Lasitha Wattaladeniya
> >> Software Engineer
> >>
> >> Mobile : +6593896893 <+65%209389%206893>
> >> Blog : techreadme.blogspot.com
> >>
> >> On Sun, Dec 18, 2016 at 8:12 PM, Furkan KAMACI 
> >> wrote:
> >>
> >>> Hi Lasitha,
> >>>
> >>> What is your indexing / querying requirements. Do you have an index
> >>> heavy/light  - query heavy/light system?
> >>>
> >>> Kind Regards,
> >>> Furkan KAMACI
> >>>
> >>> On Sun, Dec 18, 2016 at 11:35 AM, Lasitha Wattaladeniya <
> >>> watt...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hello devs,
> >>> >
> >>> > I'm here with another problem i'm facing. I'm trying to do a commit
> >>> (soft
> >>> > commit) through solrj and just after the commit, retrieve the data
> from
> >>> > solr (requirement is to get updated data list).
> >>> >
> >>> > I'm using soft commit instead of the hard commit, is previously I got
> >>> an
> >>> > error "Exceeded limit of maxWarmingSearchers=2, try again later"
> >>> because of
> >>> > too many commit requests. Now I have removed the explicit commit and
> >>> has
> >>> > let the solr to do the commit using autoSoftCommit *(1 mili second)*
> >>> and
> >>> > autoCommit *(30 seconds)* configurations. Now I'm not getting any
> >>> errors
> >>> > when i'm committing frequently.
> >>> >
> >>> > The problem i'm facing now is, I'm not getting the updated data when
> I
> >>> > fetch from solr just after the soft commit. So in this case what are
> >>> the
> >>> > best practices to use ? to wait 1 mili second before retrieving data
> >>> after
> >>> > soft commit ? I don't feel like waiting from client side is a good
> >>> option.
> >>> > Please give me some help from your expert knowledge
> >>> >
> >>> > Best regards,
> >>> > Lasitha Wattaladeniya
> >>> > Software Engineer
> >>> >
> >>> > Mobile : +6593896893
> >>> > Blog : techreadme.blogspot.com
> >>> >
> >>>
> >>
> >>
> >
>


Re: ttl on merge-time possible somehow ?

2016-12-17 Thread Dorian Hoxha
On Sat, Dec 17, 2016 at 12:04 AM, Chris Hostetter 
wrote:

>
> : > lucene, something has to "mark" the segements as deleted in order for
> them
> ...
> : Note, it doesn't mark the "segment", it marks the "document".
>
> correct, typo on my part -- sorry.
>
> : > The disatisfaction you expressed with this approach confuses me...
> : >
> : Really ?
> : If you have many expiring docs
>
> ...you didn't seem to finish that thought so i'm still not really sure
> what your're suggestion is in terms of why an alternative would be more
> efficient.
>
Sorry about that. The reason why (i think/thought) it won't be as
efficient, is because in some cases, like mine, all docs will expire,
rather fast (30 minutes in my case), so there will be a large number of
"deletes", which I thought were expensive.

So, if rocksdb would do it this way, it would have to keep 1 index on the
ttl-timestamp and then issue 2 deletes (to delete the index, original row).
While in lucene, because the storage is different, this is ~just a
deleted_bitmap[x]=1, which if you disable translog fsync (only for
ttl-delete) should be really fast and nonblocking(my issue).

So, the other way this can be made better in my opinion is (if the
optimization is not already there)
Is to make the 'delete-query' on ttl-documents operation on translog to not
be forced to fsync to disk (so still written to translog, but no fsync).
The another index/delete happens, it will also fsync the translog of the
previous 'delete ttl query'.
If the server crashes, meaning we lost those deletes because the translog
wasn't fsynced to disk, then a thread can run on startup to recheck
ttl-deletes.
This will make it so the delete-query comes "free" in disk-fsync on
translog.
Makes sense ?


>
> : "For example, with the configuration below the
> : DocExpirationUpdateProcessorFactory will create a timer thread that
> wakes
> : up every 30 seconds. When the timer triggers, it will execute a
> : *deleteByQuery* command to *remove any documents* with a value in the
> : press_release_expiration_date field value that is in the past "
>
> that document is describing a *logical* deletion as i mentioned before --
> the documents are "removed" in the sense that they are flaged "not alive"
> won't be included in future searches, but the data still lives in the
> segements on disk until a future merge.  (That is end user documentation,
> focusing on the effects as percieved by clients -- the concept of "delete"
> from a low level storage implementation is a much more involved concept
> that affects any discussion of "deleting" documents in solr, not just TTL
> based deletes)
>
> : > 1) nothing would ensure that docs *ever* get removed during perioids
> when
> : > docs aren't being added (thus no new segments, thus no merging)
> : >
> : This can be done with a periodic/smart thread that wakes up every 'ttl'
> and
> : checks min-max (or histogram) of timestamps on segments. If there are a
> : lot, do merge (or just delete the whole dead segment). At least that's
> how
> : those systems do it.
>
> OK -- with lucene/solr today we have the ConcurrentMergeScheduler which
> will watch for segments that have many (logically deleted) documents
> flaged "not alive" and will proactively merge those segments when the
> number of docs is above some configured/default threshold -- but to
> automatically flag those documents as "deleted" you need something like
> what solr is doing today.
>
I knew it checks "should we be merging". This would just be another clause.

>
>
> Again: i really feel like the only disconnect here is terminology.
>
> You're describing a background thread that wakes up periodically, scans
> the docs in each segment to see if they have an expire field > $now, and
> based on the size of the set of matches merges some segments and expunges
> the docs that were in that set.  For segments that aren't merged, docs
> stay put and are excluded from queries only by filters specified at
> request time.
>
> What Solr/Lucene has are 2 background threads: one wakes up periodically,
> scans the docs in the index to see if the expire field > $now and if so
> flags them as being "not alive" so they don't match queries at request
> time. A second thread chegks each segment to see how many docs are marked
> "not alive" -- either by the previous thread or by some other form of
> (logical) deletion -- and merges some of those segments, expunging the
> docs that were marked "not alive".  For segments that aren't merged, the
> "not alive" docs are still in the segment, but the "not alive" flag
> automatically excludes them from queries.
>
Yes I knew it functions that way.
The ~whole~ misunderstanding, is that the delete is more efficient than I
thought. The whole reason why the other storage engines did it "the other
way" is because of the efficiency of the delete on those engines.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:53 PM, Chris Hostetter 
wrote:

>
> : Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> : rocksdb . There
> : isn't a "delete old docs"query, but old docs are deleted by the storage
> : when merging. Looks like this needs to be a lucene-module which can then
> be
> : configured by solr ?
> ...
> : Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> that
> : has expired, it exists on the storage, but isn't returned by the db,
>
>
> What you're describing is exactly how segment merges work in Lucene, it's
> just a question of terminology.
>
> In Lucene, "deleting" a document is a *logical* operation, the data still
> lives in the (existing) segments but the affected docs are recorded in a
> list of deletions (and automatically excluded from future searchers that
> are opened against them) ... once the segments are merged then the deleted
> documents are "expunged" rather then being copied over to the new
> segments.
>
> Where this diverges from what you describe is that as things stand in
> lucene, something has to "mark" the segements as deleted in order for them
> to later be expunged -- in Solr right now is the code in question that
> does this via (internal) DBQ.
>
Note, it doesn't mark the "segment", it marks the "document".

>
> The disatisfaction you expressed with this approach confuses me...
>
Really ?
If you have many expiring docs

>
> >> I did some search for TTL on solr, and found only a way to do it with a
> >> delete-query. But that ~sucks, because you have to do a lot of inserts
> >> (and queries).
>
> ...nothing about this approach does any "inserts" (or queries -- unless
> you mean the DBQ itself?) so w/o more elaboration on what exactly you find
> problematic about this approach, it's hard to make any sense of your
> objection or request for an alternative.
>
"For example, with the configuration below the
DocExpirationUpdateProcessorFactory will create a timer thread that wakes
up every 30 seconds. When the timer triggers, it will execute a
*deleteByQuery* command to *remove any documents* with a value in the
press_release_expiration_date field value that is in the past "


>
> With all those caveats out of the way...
>
> What you're ultimately requesting -- new code that hooks into segment
> merging to exclude "expired" documents from being copied into the the new
> merged segments --- should be theoretically possible with a custom
> MergePolicy, but I don't really see how it would be better then the
> current approach in typically use cases (ie: i want docs excluded from
> results after the expiration date is reached, with a min tollerance of
> X) ...
>
I mentioned that the client would also make a range-query since expired
documents in this case would still be indexed.

>
> 1) nothing would ensure that docs *ever* get removed during perioids when
> docs aren't being added (thus no new segments, thus no merging)
>
This can be done with a periodic/smart thread that wakes up every 'ttl' and
checks min-max (or histogram) of timestamps on segments. If there are a
lot, do merge (or just delete the whole dead segment). At least that's how
those systems do it.

>
> 2) as you described, query clients would be required to specify date range
> filters on every query to identify the "logically live docs at this
> moment" on a per-request basis -- something that's far less efficient from
> a cachng standpoint then letting the system do a DBQ on the backened to
> affect the *global* set of logically live docs at the index level.
>
This makes sense. Deleted docs-ids is cached better than the range-query
that I said.

>
>
> Frankly: It seems to me that you've looked at how other non-lucene based
> systems X & Y handle TTL type logic and decided that's the best possible
> solution therefore the solution used by Solr "sucks" w/o taking into
> account that what's efficient in the underlying Lucene storage
> implementation might just be diff then what's efficient in the underlying
> storage implementation of X & Y.
>
Yes.

>
> If you'd like to tackle implementing TTL as a lower level primitive
> concept in Lucene, then by all means be my guest -- but personally i
> don't think you're going to find any real perf improvements in an
> approach like you describe compared to what we offer today.  i look
> forward to being proved wrong.
>
Since the implementation is apparently more efficient than I thought I'm
gonna leave it.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
Well there is a reason why they all do it that way.

I'm gonna guess that the reason lucene does it this way is because it keeps
a 'deleted docs bitset', which should act like a filter, which is not as
slow as doing a full-delete/insert like in the other dbs that I mentioned.

Thanks Shawn.

On Fri, Dec 16, 2016 at 9:57 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/16/2016 1:12 PM, Dorian Hoxha wrote:
> > Shawn, I know how it works, I read the blog post. But I don't want it
> > that
> > way. So how to do it my way? Like a custom merge function on lucene or
> > something else ?
>
> A considerable amount of custom coding.
>
> At a minimum, you'd have to write your own implementations of some
> Lucene classes and probably some Solr classes.  This sort of integration
> might also require changes to the upstream Lucene/Solr source code.  I
> doubt there would be enough benefit (either performance or anything
> else) to be worth the time and energy required.  If Lucene-level support
> would have produced a demonstrably better expiration feature, it would
> have been implemented that way.
>
> If you're *already* an expert in Lucene/Solr code, then it might be a
> fun intellectual exercise, but such a large-scale overhaul of an
> existing feature that works well is not something I would try to do.
>
> Thanks,
> Shawn
>
>


Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 8:11 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/16/2016 11:13 AM, Dorian Hoxha wrote:
> > Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> > rocksdb <https://github.com/facebook/rocksdb/wiki/Time-to-Live>. There
> > isn't a "delete old docs"query, but old docs are deleted by the
> > storage when merging. Looks like this needs to be a lucene-module
> > which can then be configured by solr ?
>
> No.  Lucene doesn't know about expiration and doesn't need to know about
> expiration.
>
It needs to know or else it will be ~non efficient in my case.

>
> The document expiration happens in Solr.  In the background, Solr
> finds/deletes old documents in the Lucene index according to how the
> expiration feature is configured.  What happens after that is basic
> Lucene operation.  If you index enough new data to trigger a merge (or
> if you do an optimize/forceMerge), then Lucene will get rid of deleted
> documents in the merged segments.  The contents of the documents in your
> index (whether that's a timestamp or something else) are completely
> irrelevant for decisions made during Lucene's segment merging.
>
Shawn, I know how it works, I read the blog post. But I don't want it that
way.
So how to do it my way? Like a custom merge function on lucene or something
else ?

>
> > Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> > that has expired, it exists on the storage, but isn't returned by the
> > db, because it checks the timestamp and sees that it's expired. Looks
> > like this also need to be in lucene?
>
> That's pretty much how Lucene (and by extension, Solr) works, except
> it's not related to expiration, it is *deleted* documents that don't
> show up in the results.
>
No it doesn't. But I want expirations to function that way. Just like you
have "custom update processors", there should be a similar way for get (so
on my custom-get-processor, I check the timestamp and return NotFound if
it's expired)

>
> Thanks,
> Shawn
>
> Makes sense ?


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 5:55 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You said that the data expires, but you haven't said
> how many docs you need to host at a time.

The data will expire in ~30 minutes average. Many of them are updates on
the same document (this makes it worse because updates are delete+insert)

> At 10M/second
> inserts you'll need a boatload of shards. All of the
> conversation about one beefy machine .vs. lots of not-so-beefy
> machines should wait until you answer that question.

Note, there will be multiple beefy machines(just less compared to small
machines). I was asking what would be a big enough one, so 1 instance will
be able to use the full server.

> For
> instance, indexing some _very_ simple documents on my
> laptop can hit 10,000 docs/second/shard. So you're talking
> 1,000 shards here. Indexing more complex docs I might
> get 1,000 docs/second/shard, so then you'd need 10,000
> shards. Don't take these as hard numbers, I'm
> just trying to emphasize that you'll need to do scaling
> exercises to see if what you want to do is reasonable given
> your constraints.
>
Of course. I think I've done ~80K/s/server on a previous project (it wasn't
the bottleneck so didn't bother too much) but there are too many knobs that
will change that number.

>
> If those 10M docs/second are bursty and you can stand some
> latency, then that's one set of considerations. If it's steady-state
> it's another. In either case you need some _serious_ design work
> before you go forward.
>
I expect 10M to be the burst. But it needs to handle the burst. And I don't
think I will do 10M requests, but small batches.

>
> And then you want to facet (fq clauses aren't nearly as expensive)
> and want 2 second commit intervals.
>
It is what it is.

>
> You _really_ need to stand up some test systems and see what
> performance you can get before launching off on trying to do the
> whole thing. Fortunately, you can stand up, say, a 4 shard system
> and tune it and drive it as fast as you possibly can and extrapolate
> from there.
>
My ~thinking~ would be to have 1 instance/server + 1 shard/core + 1 thread
for each shard. Assuming I remove all "blocking" disk operations (don't
know if network is async?), it should be ~best~ scenario. I'll have to see
how it functions more though.

>
> But to reiterate. This is a very high indexing rate that very few
> organizations have attempted. You _really_ need to do a
> proof-of-concept _then_ plan.
>
It's why I posted, to ask what have people used as big machine, and I would
test on that.

>
> Here's the long form of this argument:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-th
> e-abstract-why-we-dont-have-a-definitive-answer/


>
> Best,
> Erick
>
Thanks!

>
> On Fri, Dec 16, 2016 at 5:19 AM, GW <thegeofo...@gmail.com> wrote:
> > Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be
> spun
> > on up any host with a static IP. This has nothing to do with Solr which
> is
> > running on plain old hardware.
> >
> > Solrcloud is on a real cluster not on a SAN.
> >
> > The bit about dead with no error. I got this from a post I made asking
> > about the best way to deploy apps. Was shown some code on making your app
> > zookeeper aware. I am just getting to this so I'm talking from my ass. A
> ZK
> > aware program will have a list of nodes ready for business verses a plain
> > old Round Robin. If data on a machine is corrupted you can get 0 docs
> found
> > while a ZK aware app will know that node is shite.
> >
> >
> >
> >
> >
> >
> >
> > On 16 December 2016 at 07:20, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
> >
> >> On Fri, Dec 16, 2016 at 12:39 PM, GW <thegeofo...@gmail.com> wrote:
> >>
> >> > Dorian,
> >> >
> >> > From my reading, my belief is that you just need some beefy machines
> for
> >> > your zookeeper ensemble so they can think fast.
> >>
> >> Zookeeper need to think fast enough for cluster state/changes. So I
> think
> >> it scales with the number of machines/collections/shards and not
> documents.
> >>
> >> > After that your issues are
> >> > complicated by drive I/O which I believe is solved by using shards. If
> >> you
> >> > have a collection running on top of a single drive array it should not
> >> > compare to writing to a dozen drive arrays. So a whole bunch of light
> >> duty
> >> > machines that have a decent amount of memory and barely able process
> >> faster
> >> > than the

Re: ttl on merge-time possible somehow ?

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 4:42 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/16/2016 12:54 AM, Dorian Hoxha wrote:
> > I did some search for TTL on solr, and found only a way to do it with
> > a delete-query. But that ~sucks, because you have to do a lot of
> > inserts (and queries).
>
> You're going to have to be very specific about what you want Solr to do.
>
> > The other(kinda better) way to do it, is to set a collection-level
> > ttl, and when indexes are merged, they will drop the documents that
> > have expired in the new merged segment. On the client, I will make
> > sure to do date-range queries so I don't get back old documents. So:
> > 1. is there a way to easily modify the segment-merger (or better way?)
> > to do that ?
>
> Does the following describe the the feature you're after?
>
> https://lucidworks.com/blog/2014/05/07/document-expiration/
>
> If this is what you're after, this is *Solr* functionality.  Segment
> merging is *Lucene* functionality.  Lucene cannot remove documents
> during merge until they have been deleted.  It is Solr that handles
> deleting documents after they expire.  Lucene is not aware of the
> expiration concept.
>
Yep, that's what came in my search. See how TTL work in hbase/cassandra/
rocksdb <https://github.com/facebook/rocksdb/wiki/Time-to-Live>. There
isn't a "delete old docs"query, but old docs are deleted by the storage
when merging. Looks like this needs to be a lucene-module which can then be
configured by solr ?


> > 2. is there a way to support this also on get ? looks like I can use
> > realtimeget + filter query and it should work based on documentation
>
> Realtime get allows you to retrieve documents that have been indexed but
> not yet committed.  I doubt that deleted documents or document
> expiration affects RTG at all.  We would need to know exactly what you
> want to get working here before we can say whether or not you're right
> when you say "it should work."
>
Just like in hbase,cassandra,rocksdb, when you "select" a row/document that
has expired, it exists on the storage, but isn't returned by the db,
because it checks the timestamp and sees that it's expired. Looks like this
also need to be in lucene?

>
> Thanks,
> Shawn
>
> Makes more sense ?


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
Makes more sense, but I think the master should do the write before it can
be redirected to other replicas. So not sure if that can be done.

In elasticsearch you can have datanodes and coordinator nodes:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#coordinating-node
I don't think it's available in solr though.

On Fri, Dec 16, 2016 at 1:43 PM, Jaroslaw Rozanski <m...@jarekrozanski.com>
wrote:

> Sorry, not what I meant.
>
> Leader is responsible for distributing update requests to replica. So
> eventually all replicas have same state as leader. Not a problem.
>
> It is more about the performance of such. If I gather correctly normal
> replication happens by standard update request. Not by, say, segment copy.
>
> Which means update on leader is as "expensive" as on replica.
>
> Hence, if my understanding is correct, sending search request to replica
> only, in index heavy environment, would bring no benefit.
>
> So the question is: is there a mechanism, in SolrCloud (not legacy
> master/slave set-up) to make one node take a load of indexing which
> other nodes focus on searching.
>
> This is not a question of SolrClient cause that is clear how to direct
> search request to specific nodes. This is more about index optimization
> so that certain nodes (ie. replicas) could suffer less due to high
> volume indexing while serving search requests.
>
>
>
>
> On 16/12/16 12:35, Dorian Hoxha wrote:
> > The leader is the source of truth. You expect to make the replica the
> > source of truth or something???Doesn't make sense?
> > What people do, is send write to leader/master and reads to
> replicas/slaves
> > in other solr/other-dbs.
> >
> > On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski <m...@jarekrozanski.com
> >
> > wrote:
> >
> >> Hi all,
> >>
> >> According to documentation, in normal operation (not recovery) in Solr
> >> Cloud configuration the leader sends updates it receives to all the
> >> replicas.
> >>
> >> This means and all nodes in the shard perform same effort to index
> >> single document. Correct?
> >>
> >> Is there then a benefit to *not* to send search requests to leader, but
> >> only to replicas?
> >>
> >> Given index & search heavy Solr Cloud system, is it possible to separate
> >> search from indexing nodes?
> >>
> >>
> >> RE: Solr 5.5.0
> >>
> >> --
> >> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> >> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
> >>
> >>
> >
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>


Re: Separating Search and Indexing in SolrCloud

2016-12-16 Thread Dorian Hoxha
The leader is the source of truth. You expect to make the replica the
source of truth or something???Doesn't make sense?
What people do, is send write to leader/master and reads to replicas/slaves
in other solr/other-dbs.

On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski 
wrote:

> Hi all,
>
> According to documentation, in normal operation (not recovery) in Solr
> Cloud configuration the leader sends updates it receives to all the
> replicas.
>
> This means and all nodes in the shard perform same effort to index
> single document. Correct?
>
> Is there then a benefit to *not* to send search requests to leader, but
> only to replicas?
>
> Given index & search heavy Solr Cloud system, is it possible to separate
> search from indexing nodes?
>
>
> RE: Solr 5.5.0
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 12:39 PM, GW <thegeofo...@gmail.com> wrote:

> Dorian,
>
> From my reading, my belief is that you just need some beefy machines for
> your zookeeper ensemble so they can think fast.

Zookeeper need to think fast enough for cluster state/changes. So I think
it scales with the number of machines/collections/shards and not documents.

> After that your issues are
> complicated by drive I/O which I believe is solved by using shards. If you
> have a collection running on top of a single drive array it should not
> compare to writing to a dozen drive arrays. So a whole bunch of light duty
> machines that have a decent amount of memory and barely able process faster
> than their drive I/O will serve you better.
>
My dataset will be lower than total memory, so I expect no query to hit
disk.

>
> I think the Apache big data mandate was to be horizontally scalable to
> infinity with cheap consumer hardware. In my minds eye you are not going to
> get crazy input rates without a big horizontal drive system.
>
There is overhead with small machines, and with very big machines (pricy).
So something in the middle.
So small cluster of big machines or big cluster of small machines.

>
> I'm in the same boat. All the scaling and roll out documentation seems to
> reference the Witch Doctor's secret handbook.
>
> I just started into making my applications ZK aware and really just
> starting to understand the architecture. After a whole year I still feel
> weak while at the same time I have traveled far. I still feel like an
> amateur.
>
> My plans are to use bridge tools in Linux so all my machines are sitting on
> the switch with layer 2. Then use Conga to monitor which apps need to be
> running. If a server dies, it's apps are spun up on one of the other
> servers using the original IP and mac address through a bridge firewall
> gateway so there is no hold up with with mac phreaking like layer 3. Layer
> 3 does not like to see a route change with a mac address. My apps will be
> on a SAN ~ Data on as many shards/machines as financially possible.
>
By conga you mean https://sourceware.org/cluster/conga/spec/ ?
Also SAN may/will suck like someone answered in your thread.

>
> I was going to put a bunch of Apache web servers in round robin to talk to
> Solr but discovered that a Solr node can be dead and not report errors.
>
Please explain more "dead but no error".

> It's all rough at the moment but it makes total sense to send Solr requests
> based on what ZK says is available verses a round robin.
>
Yes, like I commenter wrote on your thread.

>
> Will keep you posted on my roll out if you like.
>
> Best,
>
> GW
>
>
>
>
>
>
>
> On 16 December 2016 at 03:31, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>
> > Hello searchers,
> >
> > I'm researching solr for a project that would require a
> max-inserts(10M/s)
> > and some heavy facet+fq on top of that, though on low qps.
> >
> > And I'm trying to find blogs/slides where people have used some big
> > machines instead of hundreds of small ones.
> >
> > 1. Largest I've found is this
> > <https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-
> > 4-machines-1-solrcloud/>
> > with 16cores + 384GB ram but they were using 25! solr4 instances / server
> > which seems wasteful to me ?
> >
> > I know that 1 solr can have max ~29-30GB heap because GC is
> wasteful/sucks
> > after that, and you should leave the other amount to the os for
> file-cache.
> > 2. But do you think 1 instance will be able to fully-use a 256GB/20core
> > machine ?
> >
> > 3. Like to share your findings/links with big-machine clusters ?
> >
> > Thank You
> >
>


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 11:31 AM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

> On Fri, 2016-12-16 at 11:19 +0100, Dorian Hoxha wrote:
> > On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen
> > <t...@statsbiblioteket.dk> wrote:
> > > We try hard to stay below 32GB, but for some setups the penalty of
> > > crossing the boundary is worth it. If, for example, having
> > > everything in 1 shard means a heap requirement of 50GB, it can be a
> > > better solution than a multi-shard setup with 2*25GB heap.
> > >
> > The heap is for the instance, not for each shard. Yeah, having less
> > shards is ~more efficient since terms-dictionary,cache etc have lower
> > duplication.
>
> True, but that was not my point. What I tried to communicate is that
> there can be a huge difference between having 1 shard in the collection
> and having more than 1 shard. Not for document searches, but for
> aggregations such as grouping and especially String faceting.
>
> - Toke Eskildsen, State and University Library, Denmark
>
Yes makes sense, I remember doing cross-shard aggs may require more than 1
call (1 call to get top(x), 1 other verify that they really are top(x)
cross-shards). So less shards less merges to get final values.


Re: Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
On Fri, Dec 16, 2016 at 10:45 AM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

> On Fri, 2016-12-16 at 09:31 +0100, Dorian Hoxha wrote:
> > I'm researching solr for a project that would require a max-
> > inserts(10M/s) and some heavy facet+fq on top of that, though on low
> > qps.
>
> You don't ask for much, do you :-) If you add high commit rate to the
> list, you have a serious candidate for worst-case.
>
I'm sorry, the commit will be 1-2 seconds :( . But this will be expiring
data, so it won't go petabytes. I can also relax disk-activity. I don't see
a config on how to relax the translog persistence ?Like I write, solr
returns 'ok', the document is in the translog, but the translog didn't get
an 'ok' from the filesystem.

>
> > And I'm trying to find blogs/slides where people have used some big
> > machines instead of hundreds of small ones.
> >
> > 1. Largest I've found is this
> > <https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-4-machines-1-
> > solrcloud/>
> > with 16cores + 384GB ram but they were using 25! solr4 instances /
> > server which seems wasteful to me ?
>
> The way those machines are set up is (nearly) the same as having 16
> quadcore machines with 96GB of RAM, each running 6 Solr instances.
> I say nearly because the shared memory is a plus as it averages
> fluctuations in Solr requirements and a minus because of the cross-
> socket penalties in NUMA.
>
> I digress, sorry. Point is that they are not really run as large
> machines. The choice of box size vs. box count was hugely driven by
> purchase & maintenance cost. Also, as that setup is highly optimized
> towards serving a static index, I don't think it would fit your very
> high update requirements.
>
> As for you argument for less Solrs, each serving multiple shards, then
> it is entirely valid. I have answered your question about this on the
> blog, but the short story is: It works now and optimizing hardware
> utilization is not high on our priority list.
>
> > I know that 1 solr can have max ~29-30GB heap because GC is
> > wasteful/sucks after that, and you should leave the other amount to
> > the os for file-cache.
>
> We try hard to stay below 32GB, but for some setups the penalty of
> crossing the boundary is worth it. If, for example, having everything
> in 1 shard means a heap requirement of 50GB, it can be a better
> solution than a multi-shard setup with 2*25GB heap.
>
The heap is for the instance, not for each shard. Yeah, having less shards
is ~more efficient since terms-dictionary,cache etc have lower duplication.

>
> > 2. But do you think 1 instance will be able to fully-use a
> > 256GB/20core machine ?
>
> I think (you should verify this) that there is some congestion issues
> in the indexing part of Solr: Feeding a single Solr with X threads will
> give you a lower index rate that feeding 2 separate Solrs (running on
> the same machine) with X/2 threads each.
>
That means the thread-pools aren't ~very scalable with number of cores.
Assuming we have 2 shards on 1 solr vs 2 solr each with 1 shard.

>
> - Toke Eskildsen, State and University Library, Denmark
>
Thanks Toke!


Max vertical scaling in your experience ? (1 instance/server)

2016-12-16 Thread Dorian Hoxha
Hello searchers,

I'm researching solr for a project that would require a max-inserts(10M/s)
and some heavy facet+fq on top of that, though on low qps.

And I'm trying to find blogs/slides where people have used some big
machines instead of hundreds of small ones.

1. Largest I've found is this

with 16cores + 384GB ram but they were using 25! solr4 instances / server
which seems wasteful to me ?

I know that 1 solr can have max ~29-30GB heap because GC is wasteful/sucks
after that, and you should leave the other amount to the os for file-cache.
2. But do you think 1 instance will be able to fully-use a 256GB/20core
machine ?

3. Like to share your findings/links with big-machine clusters ?

Thank You


ttl on merge-time possible somehow ?

2016-12-15 Thread Dorian Hoxha
Hello searchers,

I did some search for TTL on solr, and found only a way to do it with a
delete-query. But that ~sucks, because you have to do a lot of inserts (and
queries).

The other(kinda better) way to do it, is to set a collection-level ttl, and
when indexes are merged, they will drop the documents that have expired in
the new merged segment. On the client, I will make sure to do date-range
queries so I don't get back old documents.

So:
1. is there a way to easily modify the segment-merger (or better way?) to
do that ?
2. is there a way to support this also on get ? looks like I can use
realtimeget + filter query and it should work based on documentation

Thank You


Re: Search only for single value of Solr multivalue field

2016-12-15 Thread Dorian Hoxha
You should be able to filter "(word1 in field OR word2 in field) AND
NOT(word1 in field AND word2 in field)". Translate that into the right
syntax.
I don't know if lucene is smart enough to execute the filter only once (it
should be i guess).
Makes sense ?

On Thu, Dec 15, 2016 at 12:12 PM, Leo BRUVRY-LAGADEC  wrote:

> Hi,
>
> I have a multivalued field in my schema called "idx_affilliation".
>
> IFREMER, Ctr Brest, DRO Geosci Marines,
> F-29280 Plouzane, France.
> Univ Lisbon, Ctr Geofis, P-1269102 Lisbon,
> Portugal.
> Univ Bretagne Occidentale, Inst Univ
> Europeen Mer, Lab Domaines Ocean, F-29280 Plouzane, France.
> Total Explorat Prod Geosci Projets Nouveaux
> Exper, F-92078 Paris, France.
>
> I want to be able to do a query like: idx_affilliation:(IFREMER Portugal)
> and not have this document returned. In other words, I do not want queries
> to span individual values for the field.
>
> 
> ---
>
> Here are some further examples using the document above of how I want this
> to work:
>
> idx_affilliation:(IFREMER France) --> Returns it.
> idx_affilliation:(IFREMER Plouzane) --> Returns it.
> idx_affilliation:("Univ Bretagne Occidentale") --> Returns it.
> idx_affilliation:("Univ Lisbon" Portugal) --> Returns it.
> idx_affilliation:(IFREMER Portugal) --> DOES NOT RETURN IT.
>
> Does someone known if it's possible to do this ?
>
> Best regards,
> Leo.
>


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-15 Thread Dorian Hoxha
See replies inline:

On Wed, Dec 14, 2016 at 3:36 PM, GW <thegeofo...@gmail.com> wrote:

> Thanks,
>
> I understand accessing solr directly. I'm doing REST calls to a single
> machine.
>
> If I have a cluster of five servers and say three Apache servers, I can
> round robin the REST calls to all five in the cluster?
>
I don't know about php, but it would be better to have "persistent
connections" or something to the solr servers. In python for example this
is done automatically. It would be better if each php-server has a
different order of an array of [list of solr ips]. This way each box will
contact a ~different solr instance, and will have better chance of not
creating too may new connections (since the connection cache is per-url/ip).

>
> I guess I'm going to find out. :-)  If so I might be better off just
> running Apache on all my solr instances.
>
I've done that before (though with es, but it's ~same). And just contacting
the localhost solr. The problem with that, is that if the solr on the
current host fails, your php won't work. So best in this scenario is to
have an array of hosts, but the first being the local solr.

>
>
>
>
>
> On 14 December 2016 at 07:08, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>
> > See replies inline:
> >
> > On Wed, Dec 14, 2016 at 11:16 AM, GW <thegeofo...@gmail.com> wrote:
> >
> > > Hello folks,
> > >
> > > I'm about to set up a Web service I created with PHP/Apache <--> Solr
> > Cloud
> > >
> > > I'm hoping to index a bazillion documents.
> > >
> > ok , how many inserts/second ?
> >
> > >
> > > I'm thinking about using Linode.com because the pricing looks great.
> Any
> > > opinions??
> > >
> > Pricing is 'ok'. For bazillion documents, I would skip vps and go
> straight
> > dedicated. Check out ovh.com / online.net etc etc
> >
> > >
> > > I envision using an Apache/PHP round robin in front of a solr cloud
> > >
> > > My thoughts are that I send my requests to the Solr instances on the
> > > Zookeeper Ensemble. Am I missing something?
> > >
> > You contact with solr directly, don't have to connect to zookeeper for
> > loadbalancing.
> >
> > >
> > > What can I say.. I'm software oriented and a little hardware
> challenged.
> > >
> > > Thanks in advance,
> > >
> > > GW
> > >
> >
>


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-14 Thread Dorian Hoxha
See replies inline:

On Wed, Dec 14, 2016 at 11:16 AM, GW  wrote:

> Hello folks,
>
> I'm about to set up a Web service I created with PHP/Apache <--> Solr Cloud
>
> I'm hoping to index a bazillion documents.
>
ok , how many inserts/second ?

>
> I'm thinking about using Linode.com because the pricing looks great. Any
> opinions??
>
Pricing is 'ok'. For bazillion documents, I would skip vps and go straight
dedicated. Check out ovh.com / online.net etc etc

>
> I envision using an Apache/PHP round robin in front of a solr cloud
>
> My thoughts are that I send my requests to the Solr instances on the
> Zookeeper Ensemble. Am I missing something?
>
You contact with solr directly, don't have to connect to zookeeper for
loadbalancing.

>
> What can I say.. I'm software oriented and a little hardware challenged.
>
> Thanks in advance,
>
> GW
>


Re: Difference between currency fieldType and float fieldType

2016-12-07 Thread Dorian Hoxha
Yeah, you always *100 when you store,query,facet, and you always /100 when
displaying.

On Wed, Dec 7, 2016 at 3:07 PM, <esther.quan...@lucidworks.com> wrote:

> I think Edwin might be concerned that in storing it as a long type, there
> will be no distinguishing between, in example, $1234.56 and $123456.
> But correct me if I'm wrong - the latter would be stored as 12345600.
>
> When sending in a search for all values less than $100,000 on a long
> field, will there be a need to send in that value in cents? (That is,
> q=*:*=long_field[* TO 1000] )
>
> Thanks,
>
> Esther Quansah
>
> > Le 7 déc. 2016 à 07:26, Dorian Hoxha <dorian.ho...@gmail.com> a écrit :
> >
> > Come on dude, just use the int/long.
> > Source: double is still a float.
> >
> > On Wed, Dec 7, 2016 at 1:17 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > wrote:
> >
> >> Thanks for the reply.
> >>
> >> How about using the double fieldType?
> >> I tried that it works, as it is 64-bit, as compared to 32-bit for float.
> >> But will it hit the same issue again if the amount exceeds 64-bit?
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >>> On 7 December 2016 at 15:28, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
> >>>
> >>> Yeah, you'll have to do the conversion yourself (or something internal,
> >>> like the currencyField).
> >>>
> >>> Think about it as datetimes. You store everything in utc (cents), but
> >>> display to each user in it's own timezone (different currency, or just
> >> from
> >>> cents to full dollars).
> >>>
> >>> On Wed, Dec 7, 2016 at 8:23 AM, Zheng Lin Edwin Yeo <
> >> edwinye...@gmail.com>
> >>> wrote:
> >>>
> >>>> But if I index $1234.56 as "123456", won't it affect the search or
> >> facet
> >>> if
> >>>> I do a query directly to Solr?
> >>>>
> >>>> Say if I search for index with amount that is lesser that $2000, it
> >> will
> >>>> not match, unless when we do the search, we have to pass "20" to
> >>> Solr?
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>>
> >>>> On 7 December 2016 at 07:44, Chris Hostetter <
> hossman_luc...@fucit.org
> >>>
> >>>> wrote:
> >>>>
> >>>>> : Thanks for your reply.
> >>>>> :
> >>>>> : That means the best fieldType to use for money is currencyField,
> >> and
> >>>> not
> >>>>> : any other fieldType?
> >>>>>
> >>>>> The primary use case for CurrencyField is when you want to do dynamic
> >>>>> currency fluctuations between multiple currency types at query time
> >> --
> >>>> but
> >>>>> to do that you either need to use the FileExchangeRateProvider and
> >> have
> >>>>> your owne backend system to update the exchange rates, or you have to
> >>>> have
> >>>>> an openexchangerates.org account, or implement some other provider
> >>> (with
> >>>>> custom solr java code)
> >>>>>
> >>>>>
> >>>>> If you only care about a single type of currency -- for example, if
> >> all
> >>>>> you care about is is US Dollars -- then just use either
> >>>>> TrieIntField or TrieLongField and represent in the smallest possible
> >>>>> increment you need to measure -- for US Dollars this would be cents.
> >>> ie:
> >>>>> $1234.56 would be put in your index as "123456"
> >>>>>
> >>>>>
> >>>>>
> >>>>> -Hoss
> >>>>> http://www.lucidworks.com/
> >>>>>
> >>>>
> >>>
> >>
>


Re: Difference between currency fieldType and float fieldType

2016-12-07 Thread Dorian Hoxha
Come on dude, just use the int/long.
Source: double is still a float.

On Wed, Dec 7, 2016 at 1:17 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Thanks for the reply.
>
> How about using the double fieldType?
> I tried that it works, as it is 64-bit, as compared to 32-bit for float.
> But will it hit the same issue again if the amount exceeds 64-bit?
>
> Regards,
> Edwin
>
>
> On 7 December 2016 at 15:28, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>
> > Yeah, you'll have to do the conversion yourself (or something internal,
> > like the currencyField).
> >
> > Think about it as datetimes. You store everything in utc (cents), but
> > display to each user in it's own timezone (different currency, or just
> from
> > cents to full dollars).
> >
> > On Wed, Dec 7, 2016 at 8:23 AM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > wrote:
> >
> > > But if I index $1234.56 as "123456", won't it affect the search or
> facet
> > if
> > > I do a query directly to Solr?
> > >
> > > Say if I search for index with amount that is lesser that $2000, it
> will
> > > not match, unless when we do the search, we have to pass "20" to
> > Solr?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 7 December 2016 at 07:44, Chris Hostetter <hossman_luc...@fucit.org
> >
> > > wrote:
> > >
> > > > : Thanks for your reply.
> > > > :
> > > > : That means the best fieldType to use for money is currencyField,
> and
> > > not
> > > > : any other fieldType?
> > > >
> > > > The primary use case for CurrencyField is when you want to do dynamic
> > > > currency fluctuations between multiple currency types at query time
> --
> > > but
> > > > to do that you either need to use the FileExchangeRateProvider and
> have
> > > > your owne backend system to update the exchange rates, or you have to
> > > have
> > > > an openexchangerates.org account, or implement some other provider
> > (with
> > > > custom solr java code)
> > > >
> > > >
> > > > If you only care about a single type of currency -- for example, if
> all
> > > > you care about is is US Dollars -- then just use either
> > > > TrieIntField or TrieLongField and represent in the smallest possible
> > > > increment you need to measure -- for US Dollars this would be cents.
> > ie:
> > > > $1234.56 would be put in your index as "123456"
> > > >
> > > >
> > > >
> > > > -Hoss
> > > > http://www.lucidworks.com/
> > > >
> > >
> >
>


Re: Difference between currency fieldType and float fieldType

2016-12-06 Thread Dorian Hoxha
Yeah, you'll have to do the conversion yourself (or something internal,
like the currencyField).

Think about it as datetimes. You store everything in utc (cents), but
display to each user in it's own timezone (different currency, or just from
cents to full dollars).

On Wed, Dec 7, 2016 at 8:23 AM, Zheng Lin Edwin Yeo 
wrote:

> But if I index $1234.56 as "123456", won't it affect the search or facet if
> I do a query directly to Solr?
>
> Say if I search for index with amount that is lesser that $2000, it will
> not match, unless when we do the search, we have to pass "20" to Solr?
>
> Regards,
> Edwin
>
>
> On 7 December 2016 at 07:44, Chris Hostetter 
> wrote:
>
> > : Thanks for your reply.
> > :
> > : That means the best fieldType to use for money is currencyField, and
> not
> > : any other fieldType?
> >
> > The primary use case for CurrencyField is when you want to do dynamic
> > currency fluctuations between multiple currency types at query time --
> but
> > to do that you either need to use the FileExchangeRateProvider and have
> > your owne backend system to update the exchange rates, or you have to
> have
> > an openexchangerates.org account, or implement some other provider (with
> > custom solr java code)
> >
> >
> > If you only care about a single type of currency -- for example, if all
> > you care about is is US Dollars -- then just use either
> > TrieIntField or TrieLongField and represent in the smallest possible
> > increment you need to measure -- for US Dollars this would be cents. ie:
> > $1234.56 would be put in your index as "123456"
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>


Re: Difference between currency fieldType and float fieldType

2016-12-06 Thread Dorian Hoxha
Don't use float for money (in whatever db).
https://wiki.apache.org/solr/CurrencyField
What you do is save the money as cents, and store that in a long. That's
what the currencyField probably does for you inside.
It provides currency conversion at query-time.


On Tue, Dec 6, 2016 at 4:45 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Would like to understand better between the currency fieldType and float
> fieldType.
>
> If I were to index a field that is a currency field by nature (Eg: amount)
> into Solr, is it better to use the currency fieldType as compared to the
> float fieldType?
>
> I found that for the float fieldType, if the amount is very big, the last
> decimal place may get cut off in the index. For example, if the amount in
> the original document is 800212.64, the number that is indexed in Solr is
> 800212.6.
>
> Although by using the currency fieldType will solve this issue, but however
> I found that I am not able to do faceting on currency fieldType. I will
> need to have the facet so that I can list out the various amount that are
> available based on the search criteria.
>
> As such, will like to seek your recommendation to determine which fieldType
> is best for my needs.
>
> I'm using Solr 6.2.1
>
> Regards,
> Edwin
>


Re: Queries regarding solr cache

2016-12-01 Thread Dorian Hoxha
@Shawn
Any idea why the cache doesn't use roaring bitsets ?

On Thu, Dec 1, 2016 at 3:49 PM, Shawn Heisey  wrote:

> On 12/1/2016 4:04 AM, kshitij tyagi wrote:
> > I am using Solr and serving huge number of requests in my application.
> >
> > I need to know how can I utilize caching in Solr.
> >
> > As of now in  then clicking Core Selector → [core name] → Plugins /
> Stats.
> >
> > I am seeing my hit ration as 0 for all the caches. What does this mean
> and
> > how this can be optimized.
>
> If your hitratio is zero, then none of the queries related to that cache
> are finding matches.  This means that your client systems are never
> sending the same query twice.
>
> One possible reason for a zero hitratio is using "NOW" in date queries
> -- NOW changes every millisecond, and the actual timestamp value is what
> ends up in the cache.  This means that the same query with NOW executed
> more than once will actually be different from the cache's perspective.
> The solution is date rounding -- using things like NOW/HOUR or NOW/DAY.
> You could use NOW/MINUTE, but the window for caching would be quite small.
>
> 5000 entries for your filterCache is almost certainly too big.  Each
> filterCache entry tends to be quite large.  If the core has ten million
> documents in it, then each filterCache entry would be 1.25 million bytes
> in size -- the entry is a bitset of all documents in the core.  This
> includes deleted docs that have not yet been reclaimed by merging.  If a
> filterCache for an index that size (which is not all that big) were to
> actually fill up with 5000 entries, it would require over six gigabytes
> of memory just for the cache.
>
> The 1000 that you have on queryResultCache is also rather large, but
> probably not a problem.  There's also documentCache, which generally is
> OK to have sized at several thousand -- I have 16384 on mine.  If your
> documents are particularly large, then you probably would want to have a
> smaller number.
>
> It's good that your autowarmCount values are low.  High values here tend
> to make commits take a very long time.
>
> You do not need to send your message more than once.  The first repeat
> was after less than 40 minutes.  The second was after about two hours.
> Waiting a day or two for a response, particularly for a difficult
> problem, is not unusual for a mailing list.  I begain this reply as soon
> as I saw your message -- about 7:30 AM in my timezone.
>
> Thanks,
> Shawn
>
>


Realtime multi get with different (_route_, fields, etc) for each id

2016-11-30 Thread Dorian Hoxha
Hello searchers,

Looks like this is not possible, right ?
It means I have to specify all the _route_ in the request, and each shard
will try to lookup all the ids internally.

Is there a way to specify it ?

Like elasticsearch does
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-get.html#mget-routing

Thank You


Re: Index time sorting and per index mergePolicyFactory

2016-11-28 Thread Dorian Hoxha
bump after 11 days

On Thu, Nov 17, 2016 at 10:25 AM, Dorian Hoxha <dorian.ho...@gmail.com>
wrote:

> Hi,
>
> I know this is done in lucene, but I don't see it in solr (by searching +
> docs on collections).
>
> I see https://cwiki.apache.org/confluence/display/solr/
> IndexConfig+in+SolrConfig but it's not mentioned for index-time-sorting.
>
> So, is it possible and definable for each index ? I want to have some
> sorted by 'x' field, some by 'y' field, and some staying as default.
>
> Thank You
>


Re: update a document without changing anything

2016-11-27 Thread Dorian Hoxha
Thanks Ishan.

On Sun, Nov 27, 2016 at 6:42 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Maybe do an "inc" of 0 to a numeric field for every document.
> https://cwiki.apache.org/confluence/display/solr/
> Updating+Parts+of+Documents
>
> On Wed, Nov 23, 2016 at 2:13 PM, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
>
> > Hello searcherers,
> >
> > So, I have document that is fully stored. Then I make small change in
> > schema. And now I have to reinsert every document. But I'm afraid of
> doing
> > a get+insert, because something else may change the document in the
> > meantime. So I want to do an "update" of nothing, so internally on the
> > master-shard, the document is updated without changes. Maybe an update
> with
> > no modifiers ?
> >
> > Thank You!
> >
>


Re: Best python 3 client for solrcloud

2016-11-24 Thread Dorian Hoxha
Hi Nick,

What I care most is the low-level stuff to work good (like cloud, retries,
zookeeper(i don't think that's needed for normal requests), maybe even
routing to the right core/replica?).
And your client looked best on an overview.

On Thu, Nov 24, 2016 at 10:07 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:

> I am a comitter for
>
> https://github.com/moonlitesolutions/SolrClient.
>
> I think its pretty good, my aim with it is to provide several reusable
> modules for working with Solr in python. Not just querying, but working
> with collections indexing, reindexing, etc..
>
> Check it out and let me know what you think.
>
> On Nov 24, 2016 3:51 PM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:
>
> > Hi searchers,
> >
> > I see multiple clients for solr in python but each one looks like misses
> > many features. What I need is for at least the low-level api to work with
> > cloud (like retries on different nodes and nice exceptions). What is the
> > best that you use currently ?
> >
> > Thank You!
> >
>


Best python 3 client for solrcloud

2016-11-24 Thread Dorian Hoxha
Hi searchers,

I see multiple clients for solr in python but each one looks like misses
many features. What I need is for at least the low-level api to work with
cloud (like retries on different nodes and nice exceptions). What is the
best that you use currently ?

Thank You!


Re: Should zookeeper be run on the worker machines?

2016-11-23 Thread Dorian Hoxha
You can, but you should not. Source: heavy load may slow zookeeper
resulting in timeouts etc.

On Wed, Nov 23, 2016 at 5:00 PM, Tech Id  wrote:

> Hi,
>
> Can someone please respond to this zookeeper-for-Solr Stack-Overflow
> question: http://stackoverflow.com/questions/40755137/should-
> zookeeper-be-run-on-the-worker-machines
>
> Thanks
> TI
>


update a document without changing anything

2016-11-23 Thread Dorian Hoxha
Hello searcherers,

So, I have document that is fully stored. Then I make small change in
schema. And now I have to reinsert every document. But I'm afraid of doing
a get+insert, because something else may change the document in the
meantime. So I want to do an "update" of nothing, so internally on the
master-shard, the document is updated without changes. Maybe an update with
no modifiers ?

Thank You!


Re: Solr/lucene "planet" + recommendations for blogs to follow

2016-11-23 Thread Dorian Hoxha
It's why I mentioned the sponsoring.
Another things that's missing is a list of plugins,extensions. How to find
those ? I've seen solr.cool but I thought there would be more, looks kinda
incomplete.

On Tue, Nov 22, 2016 at 12:56 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I tried weekly. I did not have personal bandwidth for that. It
> actually takes quite a lot of time to do the newsletter, especially
> since I also try to update the website (a separate messy/hacky story).
> And since English is not my first language and writing short copy is
> harder than a long one :-)
>
> The curation project would obviously help once I get to it, as the
> same material would contribute to both sources, just in different
> volumes.
>
> Thanks for bug report. The screenshot does not make it through to the
> public (as this thread is) mailing list, but I'll figure it out. I
> have enough info.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 November 2016 at 22:45, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > Thanks Alex, some kind of weekly newsletter would be great (examples I
> > subscribe to are db weekly, postgresql weekly, redis weekly).
> >
> > If it makes sense, to make it weekly, add some sponsor(targeted) to it,
> and
> > it should be nicer. Maybe even include es,lucene if there's not enough
> > content or there's interest.
> >
> > A small bug on your site, the twitter widget is on top of the sign-up
> form
> > (maybe only happens on small resolutions, happened on fullscreen for me).
> > See attached screenshot.
> >
> > On Tue, Nov 22, 2016 at 12:22 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >>
> >> I am not aware of any aggregator like that. And I looked, hard.
> >>
> >> I, myself, publish a newsletter (Solr Start, in signature, every 2
> >> weeks) that usually has a couple of links to cool Solr stuff I found.
> >> Subscribing to newsletter also gives access to full archives...
> >>
> >> To find the links, I have a bunch of ad-hoc keyword trackers installed
> >> for that. Just basic hacks for now.
> >>
> >> I am also _thinking_ of creating an aggregator. But not so much the
> >> planet style as a Yahoo-directory/open-directory style. For which
> >> (Yahoo style directory curation and generation), I cannot seem to find
> >> a good software package either. So, I may build one from scratch.
> >> Probably just as hacky, just because my skills are not universal. A
> >> hacky version will probably look like Twitter keyword scanner with URL
> >> deduplication, fully manual curation and Wordpress as a publishing
> >> platform.
> >>
> >> But if anybody is interesting in helping with building a proper
> >> open-source one as a small big-data pipeline (in Java), give me a
> >> yell. The non-hacky system will probably need to put together a
> >> crawler (twitter, websites, etc), a graph database, possibly some
> >> analyzer/reducer/aggregator, manual/ML curator/tagger, and (in my
> >> mind) static site builder with Solr (duh!) as a search backend. I have
> >> a lot more design thoughts of course, but the list is not the right
> >> place for multi-page idea dump :-) And I am happy to read anybody
> >> else's idea dumps on this concept, sent off-the-list.
> >>
> >> As to "what's happening" - subscribing to JIRA list and filtering out
> >> issue notifications is probably a reasonable way to see what work is
> >> going on. I have filters that specifically catch CREATE issue emails.
> >> I also review release notes in details. That keeps me up to date with
> >> new stuff. Older stuff or in-depth explanations of new stuff is -
> >> unfortunately - all over the place, so it is hard to give a short list
> >> of things to follow. Of course, Lucidworks blog seems to be pretty
> >> active: https://lucidworks.com/blog/
> >>
> >> Regards,
> >>Alex.
> >>
> >> 
> >> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >>
> >>
> >> On 22 November 2016 at 21:56, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
> >> > Hello searcherers,
> >> >
> >> > Is there a solr/lucene "planet" like planet.postgresql.org ? If not,
> >> > what
> >> > are some blogs/rss/feeds that I should follow to learn what's
> happening
> >> > in
> >> > the solr/lucene worlds ?
> >> >
> >> > Thank You
> >
> >
>


Re: Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Dorian Hoxha
Thanks Alex, some kind of weekly newsletter would be great (examples I
subscribe to are db weekly, postgresql weekly, redis weekly).

If it makes sense, to make it weekly, add some sponsor(targeted) to it, and
it should be nicer. Maybe even include es,lucene if there's not enough
content or there's interest.

A small bug on your site, the twitter widget is on top of the sign-up form
(maybe only happens on small resolutions, happened on fullscreen for me).
See attached screenshot.

On Tue, Nov 22, 2016 at 12:22 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I am not aware of any aggregator like that. And I looked, hard.
>
> I, myself, publish a newsletter (Solr Start, in signature, every 2
> weeks) that usually has a couple of links to cool Solr stuff I found.
> Subscribing to newsletter also gives access to full archives...
>
> To find the links, I have a bunch of ad-hoc keyword trackers installed
> for that. Just basic hacks for now.
>
> I am also _thinking_ of creating an aggregator. But not so much the
> planet style as a Yahoo-directory/open-directory style. For which
> (Yahoo style directory curation and generation), I cannot seem to find
> a good software package either. So, I may build one from scratch.
> Probably just as hacky, just because my skills are not universal. A
> hacky version will probably look like Twitter keyword scanner with URL
> deduplication, fully manual curation and Wordpress as a publishing
> platform.
>
> But if anybody is interesting in helping with building a proper
> open-source one as a small big-data pipeline (in Java), give me a
> yell. The non-hacky system will probably need to put together a
> crawler (twitter, websites, etc), a graph database, possibly some
> analyzer/reducer/aggregator, manual/ML curator/tagger, and (in my
> mind) static site builder with Solr (duh!) as a search backend. I have
> a lot more design thoughts of course, but the list is not the right
> place for multi-page idea dump :-) And I am happy to read anybody
> else's idea dumps on this concept, sent off-the-list.
>
> As to "what's happening" - subscribing to JIRA list and filtering out
> issue notifications is probably a reasonable way to see what work is
> going on. I have filters that specifically catch CREATE issue emails.
> I also review release notes in details. That keeps me up to date with
> new stuff. Older stuff or in-depth explanations of new stuff is -
> unfortunately - all over the place, so it is hard to give a short list
> of things to follow. Of course, Lucidworks blog seems to be pretty
> active: https://lucidworks.com/blog/
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 November 2016 at 21:56, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > Hello searcherers,
> >
> > Is there a solr/lucene "planet" like planet.postgresql.org ? If not,
> what
> > are some blogs/rss/feeds that I should follow to learn what's happening
> in
> > the solr/lucene worlds ?
> >
> > Thank You
>


Re: Multiple search-queries in 1 http request ?

2016-11-22 Thread Dorian Hoxha
@Alex
Yes, that should also support more efficient serialization(binary) like
msgpack etc.

On Tue, Nov 22, 2016 at 1:33 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> HTTP 2 and whatever that Google's new protocol is are both into
> pipelining over the same connection (HTTP 1.1 too, but not as well).
> So, I feel, the right approach would be instead to check whether
> SolrJ/Jetty can handle those and not worry about it within Solr
> itself.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 November 2016 at 04:58, Walter Underwood <wun...@wunderwood.org>
> wrote:
> > A agree that dispatching multiple queries is better.
> >
> > With multiple queries, we need to deal with multiple result codes,
> multiple timeouts, and so on. Then write tests for all that stuff.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 21, 2016, at 9:55 AM, Christian Ortner <chris.ort...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> there has been an JIRA issue[0] for a long time that contains some
> patches
> >> for multiple releases of Solr that implement this functionality. It's a
> >> different topic if those patches still work in recent versions, and the
> >> issue has been resolved as a won't fix.
> >>
> >> Personally, I think starting multiple queries asynchronously right after
> >> each other has little disadvantages over a batching mechanism.
> >>
> >> Best regards,
> >> Chris
> >>
> >>
> >> [0] https://issues.apache.org/jira/browse/SOLR-1093
> >>
> >> On Thu, Nov 17, 2016 at 7:50 PM, Mikhail Khludnev <m...@apache.org>
> wrote:
> >>
> >>> Hello,
> >>> There is nothing like that in Solr.
> >>>
> >>> On Thursday, November 17, 2016, Dorian Hoxha <dorian.ho...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I couldn't find anything in core for "multiple separate queries in 1
> http
> >>>> request" like elasticsearch
> >>>> <https://www.elastic.co/guide/en/elasticsearch/reference/
> >>>> current/search-multi-search.html>
> >>>> ? I found this
> >>>> <http://blog.nbostech.com/2015/09/solr-search-conponent-
> >>> multiple-queries/>
> >>>> blog-post though I thought there is/should/would be something in core
> ?
> >>>>
> >>>> Thank You
> >>>>
> >>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>>
> >
>


Solr/lucene "planet" + recommendations for blogs to follow

2016-11-22 Thread Dorian Hoxha
Hello searcherers,

Is there a solr/lucene "planet" like planet.postgresql.org ? If not, what
are some blogs/rss/feeds that I should follow to learn what's happening in
the solr/lucene worlds ?

Thank You


Re: Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-22 Thread Dorian Hoxha
Yeah that looks like the _source that elasticsearch has.

On Mon, Nov 21, 2016 at 9:20 PM, Michael Joyner <mich...@newsrx.com> wrote:

> Have a "store only" text field that contains a serialized (json?) of the
> master object for deserilization as part of the results parsing if you are
> wanting to save a DB lookup.
>
> I would still store everything in a DB though to have a "master" copy of
> everthing.
>
>
>
> On 11/18/2016 04:45 AM, Dorian Hoxha wrote:
>
>> @alex
>> That makes sense, but it can be ~fixed by just storing every field that
>> you
>> need.
>>
>> @Walter
>> Many of those things are missing from many nosql dbs yet they're used as
>> source of data.
>> As long as the backup is "point in time", meaning consistent timestamp
>> across all shards it ~should be ok for many usecases.
>>
>> The 1-line-curl may need a patch to be disabled from config.
>>
>> On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>>
>> I agree, it is a bad idea.
>>>
>>> Solr is missing nearly everything you want in a repository, because it is
>>> not designed to be a repository.
>>>
>>> Does not have:
>>>
>>> * access control
>>> * transactions
>>> * transactional backup
>>> * dump and load
>>> * schema migration
>>> * versioning
>>>
>>> And so on.
>>>
>>> Also, I’m glad to share a one-line curl command that will delete all the
>>> documents
>>> in your collection.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>> On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch <arafa...@gmail.com>
>>>>
>>> wrote:
>>>
>>>> I've heard of people doing it but it is not recommended.
>>>>
>>>> One of the biggest implementation breakthroughs is that - after the
>>>> initial learning curve - you will start mapping your input data to
>>>> signals. Those signals will not look very much like your original data
>>>> and therefore are not terribly suitable to be the source of it.
>>>>
>>>> We are talking copyFields, UpdateRequestProcessor pre-processing,
>>>> fields that are not stored, nested documents flattening,
>>>> denormalization, etc. Getting back from that to original shape of data
>>>> is painful.
>>>>
>>>> Regards,
>>>>Alex.
>>>> 
>>>> Solr Example reading group is starting November 2016, join us at
>>>> http://j.mp/SolrERG
>>>> Newsletter and resources for Solr beginners and intermediates:
>>>> http://www.solr-start.com/
>>>>
>>>>
>>>> On 17 November 2016 at 18:46, Dorian Hoxha <dorian.ho...@gmail.com>
>>>>
>>> wrote:
>>>
>>>> Hi,
>>>>>
>>>>> Anyone use solr for source-of-data with no `normal` db (of course with
>>>>> normal backups/replication) ?
>>>>>
>>>>> Are there any drawbacks ?
>>>>>
>>>>> Thank You
>>>>>
>>>>
>>>
>


Re: Bkd tree numbers/geo on solr 6.3 ?

2016-11-18 Thread Dorian Hoxha
Looks like it needs https://issues.apache.org/jira/browse/SOLR-8396 .

On Thu, Nov 17, 2016 at 2:41 PM, Dorian Hoxha <dorian.ho...@gmail.com>
wrote:

> Hi,
>
> I've read that lucene 6 has fancy bkd-tree implementation for numbers. But
> on latest cwiki I only see TrieNumbers. Aren't they implemented or did I
> miss something (they still mention "indexing multiple values for
> range-queries" , which is the old way)?
>
> Thank You
>


Re: Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-18 Thread Dorian Hoxha
@alex
That makes sense, but it can be ~fixed by just storing every field that you
need.

@Walter
Many of those things are missing from many nosql dbs yet they're used as
source of data.
As long as the backup is "point in time", meaning consistent timestamp
across all shards it ~should be ok for many usecases.

The 1-line-curl may need a patch to be disabled from config.

On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> I agree, it is a bad idea.
>
> Solr is missing nearly everything you want in a repository, because it is
> not designed to be a repository.
>
> Does not have:
>
> * access control
> * transactions
> * transactional backup
> * dump and load
> * schema migration
> * versioning
>
> And so on.
>
> Also, I’m glad to share a one-line curl command that will delete all the
> documents
> in your collection.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
> >
> > I've heard of people doing it but it is not recommended.
> >
> > One of the biggest implementation breakthroughs is that - after the
> > initial learning curve - you will start mapping your input data to
> > signals. Those signals will not look very much like your original data
> > and therefore are not terribly suitable to be the source of it.
> >
> > We are talking copyFields, UpdateRequestProcessor pre-processing,
> > fields that are not stored, nested documents flattening,
> > denormalization, etc. Getting back from that to original shape of data
> > is painful.
> >
> > Regards,
> >   Alex.
> > 
> > Solr Example reading group is starting November 2016, join us at
> > http://j.mp/SolrERG
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 17 November 2016 at 18:46, Dorian Hoxha <dorian.ho...@gmail.com>
> wrote:
> >> Hi,
> >>
> >> Anyone use solr for source-of-data with no `normal` db (of course with
> >> normal backups/replication) ?
> >>
> >> Are there any drawbacks ?
> >>
> >> Thank You
>
>


Changing route and still ending on the same shard (or increasing the % of shards a tenant is distrbuted on without reindexing)

2016-11-17 Thread Dorian Hoxha
Hi,

Assuming I use `tenant1/4!doc50` for id (which means 1/16th of shards), and
I later change it to `tenant1/2!doc50` (which means 1/8), is it guaranteed
that the document will go to the same shard ? (it would be nice, but I
don't think so). Meaning , when you change the `/x!`, do you have to
reindex all data for that tenant ? (if not, is there a way without fully
reindexing a tenant?) This probably will also fail because the id changes,
but what if using another field for _routeing and id stays the same ?

Thank You


Bkd tree numbers/geo on solr 6.3 ?

2016-11-17 Thread Dorian Hoxha
Hi,

I've read that lucene 6 has fancy bkd-tree implementation for numbers. But
on latest cwiki I only see TrieNumbers. Aren't they implemented or did I
miss something (they still mention "indexing multiple values for
range-queries" , which is the old way)?

Thank You


Multiple search-queries in 1 http request ?

2016-11-17 Thread Dorian Hoxha
Hi,

I couldn't find anything in core for "multiple separate queries in 1 http
request" like elasticsearch

? I found this

blog-post though I thought there is/should/would be something in core ?

Thank You


Updating documents with docvalues (not stored), commit question

2016-11-17 Thread Dorian Hoxha
Looks like you can update documents even using just doc-values (without
stored). While I understand the columnar-format, my issue with this is that
docValues are added when a 'commit' is done (right?). Does that mean that
it will force a commit (which is a slow operation) when updating with
docValues or does it do something more smart ?

Thank You


Re: Parent child relationship, where children aren't nested but separate (like elasticsearch)

2016-11-17 Thread Dorian Hoxha
It's not mentioned on that page, but I'm assuming the join should work on
solrcloud when joining the same collection with the same routing (example:
users and user_events both routed by user_id (and joining on user_id))


On Thu, Nov 17, 2016 at 10:23 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> You want just the usual join (not the block-join). That's the way it
> was before nested documents became supported.
> https://cwiki.apache.org/confluence/display/solr/Other+
> Parsers#OtherParsers-JoinQueryParser
>
> Also, Elasticsearch - as far as I remember - stores the original
> document structure (including children) as a special field and then
> flattens all the children into parallel fields within parent. Which
> causes interesting hidden ranking issues, but that's an issue for a
> different day.
>
> Rgards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 17 November 2016 at 18:08, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > Hi,
> >
> > I'm not finding a way to support parent-child like es does (using
> > blockjoin)? I've seen some blogs
> > <http://www.slideshare.net/anshumg/working-with-deeply-
> nested-documents-in-apache-solr>
> > with having children as nested inside the parent-document, but I want to
> > freely crud childs/parents as separate documents (i know that nested also
> > writes separate documents) and have a special field to link them +
> manually
> > route them to the same shard.
> >
> > Is this possible/available ?
> >
> > Thank You
>


Re: "add and limit" update modifier or scripted update like elasticsearch

2016-11-17 Thread Dorian Hoxha
Hi Alex,

Yes I saw the udpate-modifiers, but there isn't an add-and-limit() thing.
The update request processors should work.

Thanks

On Thu, Nov 17, 2016 at 10:26 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Solr has an partial update support, though you need to be careful to
> have all fields retrievable (stored or docvalue).
> https://cwiki.apache.org/confluence/display/solr/
> Updating+Parts+of+Documents
>
> Solr also has UpdateRequestProcessor which can do many things,
> including scripting.
> https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors
> I believe, you would need to place it AFTER DistributedUpdateProcessor
> if you want to apply it on the whole reconstructed "updated" document
> as opposed to just on changes sent.
>
> Regards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 17 November 2016 at 18:06, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> > Hi,
> >
> > Is there an "add and limit" update modifier (couldn't find in docs) ? If
> > not, can I run a script to update a document (still couldn't find
> anything)
> > ? If not, how should I do that  (custom plugin? )?
> >
> > Thank You
>


Index time sorting and per index mergePolicyFactory

2016-11-17 Thread Dorian Hoxha
Hi,

I know this is done in lucene, but I don't see it in solr (by searching +
docs on collections).

I see
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
but it's not mentioned for index-time-sorting.

So, is it possible and definable for each index ? I want to have some
sorted by 'x' field, some by 'y' field, and some staying as default.

Thank You


Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-16 Thread Dorian Hoxha
Hi,

Anyone use solr for source-of-data with no `normal` db (of course with
normal backups/replication) ?

Are there any drawbacks ?

Thank You


How many versions do you stay behind in production for better stability ?

2016-11-16 Thread Dorian Hoxha
Hi,

I see that there is a new release on every lucene release. Do you always
use the latest version since it may have bugs (ex most cassandra
productions are old compared to latest `stable` version because they're not
stable). How much behind do you usually stay ? (ex: 6.3 just came out, and
you need to be in production after 1 month, will you upgrade on dev if you
don't need any new feature?)

Thank You


Parent child relationship, where children aren't nested but separate (like elasticsearch)

2016-11-16 Thread Dorian Hoxha
Hi,

I'm not finding a way to support parent-child like es does (using
blockjoin)? I've seen some blogs

with having children as nested inside the parent-document, but I want to
freely crud childs/parents as separate documents (i know that nested also
writes separate documents) and have a special field to link them + manually
route them to the same shard.

Is this possible/available ?

Thank You


"add and limit" update modifier or scripted update like elasticsearch

2016-11-16 Thread Dorian Hoxha
Hi,

Is there an "add and limit" update modifier (couldn't find in docs) ? If
not, can I run a script to update a document (still couldn't find anything)
? If not, how should I do that  (custom plugin? )?

Thank You


Re: book for Solr 3.4?

2016-11-16 Thread Dorian Hoxha
@HelponR
Curious why you're interested in an old version ?

On Tue, Nov 15, 2016 at 11:43 PM, HelponR  wrote:

> Thank you. Just found one here https://wiki.apache.org/solr/SolrResources
>
> "Apache Solr 3 Enterprise Search Server
>  by David Smiley and Eric
> Pugh. This is the 2nd edition of the first book, published by Packt.
> Essential reading for developers, this book covers nearly every feature up
> thru Solr 3.4. "
>
>
> On Tue, Nov 15, 2016 at 2:15 PM, Deeksha Sharma 
> wrote:
>
> > BTW its Apache Solr 4 Cookbook
> > 
> > From: Deeksha Sharma 
> > Sent: Tuesday, November 15, 2016 2:06 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: book for Solr 3.4?
> >
> > Apache solr cookbook will definitely help you get started. This is in
> > addition to the Apache Solr official documentation.
> >
> >
> > Thanks
> > Deeksha
> > 
> > From: HelponR 
> > Sent: Tuesday, November 15, 2016 2:03 PM
> > To: solr-user@lucene.apache.org
> > Subject: book for Solr 3.4?
> >
> > Hello!
> >
> > Is there a good book for Solr 3.4? The "Solr in Action" is for 4.4.
> >
> > googling did not help:(
> >
> > Thanks!
> >
>