from:"David Hastings"

Re: SOLR upgrade

2021-02-12 Thread David Hastings

i generally will only upgrade every other release.  since i started with
1.4, went to 3->5->7.X, and never EVER a .0 or an even .X release,

On Fri, Feb 12, 2021 at 12:01 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Just avoid 8.8.0 for the moment, until 8.8.1 is released. 8.7.x should be
> fine.
>
> On Fri, Feb 12, 2021 at 10:28 PM Alessandro Benedetti <
> a.benede...@sease.io>
> wrote:
>
> > Hi,
> > following up on Charlie's detailed response I would recommend carefully
> > assess the code you are using to interact with Apache Solr (on top of the
> > Solr changes themselves).
> > Assuming you are using some sort of client, it's extremely important to
> > fully understand both the syntax and semantic of each call.
> > I saw a lot of "compiling ok" search-api migrations that were ok
> > syntactically but doing a disaster from the semantic perspective (missing
> > important parameters ect).
> >
> > In case you have plugins to maintain this would be even more complicated
> > than just make them compile.
> >
> > Regards
> > --
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R Software Engineer, Search Consultant
> > www.sease.io
> >
> >
> > On Tue, 9 Feb 2021 at 11:01, Charlie Hull <
> ch...@opensourceconnections.com
> > >
> > wrote:
> >
> > > Hi Lulu,
> > >
> > > I'm afraid you're going to have to recognise that Solr 5.2.1 is very
> > > out-of-date and the changes between this version and the current 8.x
> > > releases are significant. A direct jump is I think the only sensible
> > > option.
> > >
> > > Although you could take the current configuration and attempt to
> upgrade
> > > it to work with 8.x, I recommend that you should take the chance to
> look
> > > at your whole infrastructure (from data ingestion through to query
> > > construction) and consider what needs upgrading/redesigning for both
> > > performance and future-proofing. You shouldn't just attempt a
> > > lift-and-shift of the current setup - some things just won't work and
> > > some may lock you into future issues. If you're running at large scale
> > > (I've talked to some people at the BL before and I know you have some
> > > huge indexes there!) then a redesign may be necessary for scalability
> > > reasons (cost and feasibility). You should also consider your skills
> > > base and how the team can stay up to date with Solr changes and modern
> > > search practice.
> > >
> > > Hope this helps - this is a common situation which I've seen many times
> > > before, you're certainly not the oldest version of Solr running I've
> > > seen recently either!
> > >
> > > best
> > >
> > > Charlie
> > >
> > > On 09/02/2021 01:14, Paul, Lulu wrote:
> > > > Hi SOLR team,
> > > >
> > > > Please may I ask for advice regarding upgrading the SOLR version (our
> > > project currently running on solr-5.2.1) to the latest version?
> > > > What are the steps, breaking changes and potential issues ? Could
> this
> > > be done as an incremental version upgrade or a direct jump to the
> newest
> > > version?
> > > >
> > > > Much appreciate the advice, Thank you!
> > > >
> > > > Best Wishes
> > > > Lulu
> > > >
> > > >
> > > >
> > >
> >
> **
> > > > Experience the British Library online at www.bl.uk >
> > > > The British Library's latest Annual Report and Accounts :
> > > www.bl.uk/aboutus/annrep/index.html<
> > > http://www.bl.uk/aboutus/annrep/index.html>
> > > > Help the British Library conserve the world's knowledge. Adopt a
> Book.
> > > www.bl.uk/adoptabook
> > > > The Library's St Pancras site is WiFi - enabled
> > > >
> > >
> >
> *
> > > > The information contained in this e-mail is confidential and may be
> > > legally privileged. It is intended for the addressee(s) only. If you
> are
> > > not the intended recipient, please delete this e-mail and notify the
> > > postmas...@bl.uk : The contents of this
> e-mail
> > > must not be disclosed or copied without the sender's consent.
> > > > The statements and opinions expressed in this message are those of
> the
> > > author and do not necessarily reflect those of the British Library. The
> > > British Library does not take any responsibility for the views of the
> > > author.
> > > >
> > >
> >
> *
> > > > Think before you print
> > > >
> > >
> > > --
> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > 
> > > Founding member of The Search Network 
> > > and co-author of Searching the Enterprise
> > > 
> > > tel/fax: +44

Re: Frequent Index Replication Failure in solr.

2020-11-13 Thread David Hastings

looks like youre repeater is grabbing a file that the master merged into a
different file, why not lower how often you go from master->repeater,
and/or dont commit so often so you can make the index faster

On Fri, Nov 13, 2020 at 12:13 PM Parshant Kumar
 wrote:

> All,please help on this
>
> On Tue, Nov 3, 2020, 6:01 PM Parshant Kumar 
> wrote:
>
> > Hi team,
> >
> > We are having solr architecture as *master->repeater-> 3 slave servers.*
> >
> > We are doing incremental indexing on the master server(every 20 min) .
> > Replication of index is done from master to repeater server(every 10
> mins)
> > and from repeater to 3 slave servers (every 3 hours).
> > *We are facing the frequent replication failure between master to
> repeater
> > server  as well as between repeater  to slave servers.*
> > On checking logs found that every time one of the below  exceptions
> > occurred whenever the replication has failed .
> >
> > 1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
> > bytes)
> > java.io.EOFException: Unexpected end of ZLIB input stream
> > at
> > java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
> > at
> > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> > at
> >
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
> > at
> >
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
> >
> >
> > 2)
> > WARN : Error getting file length for [segments_568]
> > java.nio.file.NoSuchFileException:
> >
> /data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
> > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> > at
> >
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> > at
> >
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> > at
> >
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> > at java.nio.file.Files.readAttributes(Files.java:1737)
> > at java.nio.file.Files.size(Files.java:2332)
> > at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
> > at
> >
> org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
> > at
> >
> org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
> > at
> >
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)
> >
> > 3)
> > WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of
> > 555377795 bytes)
> > org.apache.http.MalformedChunkCodingException: CRLF expected at end of
> > chunk
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.read(ChunkedInputStream.java:186)
> > at
> >
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> > at
> > java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
> > at
> > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> > at
> >
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1458)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1390)
> > at
> >
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:872)
> > at
> >
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:438)
> > at
> >
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
> >
> > *Replication configuration of master,repeater,slave's is given below:*
> >
> > 
> > 
> >

Re: converting string to solr.TextField

2020-10-16 Thread David Hastings

sorry, i was thinking just using the
*:*
method for clearing the index would leave them still

On Fri, Oct 16, 2020 at 4:28 PM Erick Erickson 
wrote:

> Not sure what you’re asking here. re-indexing, as I was
> using the term, means completely removing the index and
> starting over. Or indexing to a new collection. At any
> rate, starting from a state where there are _no_ segments.
>
> I’m guessing you’re still thinking that re-indexing without
> doing the above will work; it won’t. The way merging works,
> it chooses segments based on a number of things, including
> the percentage deleted documents. But there are still _other_
> live docs in the segment.
>
> Segment S1 has docs 1, 2, 3, 4 (old definition)
> Segment S2 has docs 5, 6, 7, 8 (new definition)
>
> Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
> discussion about not being able to do the right thing kicks in.
> Should S3 use the new or old definition? Whichever one
> it uses is wrong for the other segment. And remember,
> Lucene simply _cannot_ “do the right thing” if the data
> isn’t there.
>
> What you may be missing is that a segment is a “mini-index”.
> The underlying assumption is that all documents in that
> segment are produced with the same schema and can be
> accessed the same way. My comments about merging
> “doing the right thing” is really about transforming docs
> so all the docs can be treated the same. Which they can’t
> if they were produced with different schemas.
>
> Robert Muir’s statement is interesting here, built
> on Mike McCandless’ comment:
>
> "I think the key issue here is Lucene is an index not a database.
> Because it is a lossy index and does not retain all of the user’s
> data, its not possible to safely migrate some things automagically.
> …. The function is y = f(x) and if x is not available its not
> possible, so lucene can't do it."
>
> Don’t try to get around this. Prepare to
> re-index the entire corpus into a new collection whenever
> you change the schema and then maybe use an alias to
> seamlessly convert from the user’s perspective. If you
> simply cannot re-index from the system-of-record, you have
> two choices:
>
> 1> use new collections whenever you need to change the
>  schema and “somehow” have the app do different things
> with the new and old collections
>
> 2> set stored=true for all your source fields (i.e. not
>copyField destination). You can either roll your own
>program that pulls data from the old and sends
>it to the new or use the Collections API REINDEXCOLLECTION
>API call. But note that it’s specifically called out
>in the docs that all fields must be stored to use the
> API, what happens under the covers is that the
>  stored fields are read and sent to the target
>collection.
>
> In both these cases, Robert’s comment doesn’t apply. Well,
> it does apply but “if x is not available” is not the case,
> the original _is_ available; it’s the stored data...
>
> I’m over-stating the case somewhat, there are a few changes
> that you can get away with re-indexing all the docs into an
> existing index, things like changing from stored=true to
> stored=false, adding new fields, deleting fields (although the
> meta-data for the field is still kept around) etc.
>
> > On Oct 16, 2020, at 3:57 PM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > Gotcha, thanks for the explanation.  another small question if you
> > dont mind, when deleting docs they arent actually removed, just tagged as
> > deleted, and the old field/field type is still in the index until
> > merged/optimized as well, wouldnt that cause almost the same conflicts
> > until then?
> >
> > On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson 
> > wrote:
> >
> >> Doesn’t re-indexing a document just delete/replace….
> >>
> >> It’s complicated. For the individual document, yes. The problem
> >> comes because the field is inconsistent _between_ documents, and
> >> segment merging blows things up.
> >>
> >> Consider. I have segment1 with documents indexed with the old
> >> schema (String in this case). I  change my schema and index the same
> >> field as a text type.
> >>
> >> Eventually, a segment merge happens and these two segments get merged
> >> into a single new segment. How should the field be handled? Should it
> >> be defined as String or Text in the new segment? If you convert the docs
> >> with a Text definition for the field to String,
> >> you’d lose the ability to search for individual tokens. If you convert
> the
> >> Str

Re: converting string to solr.TextField

2020-10-16 Thread David Hastings

Gotcha, thanks for the explanation.  another small question if you
dont mind, when deleting docs they arent actually removed, just tagged as
deleted, and the old field/field type is still in the index until
merged/optimized as well, wouldnt that cause almost the same conflicts
until then?

On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson 
wrote:

> Doesn’t re-indexing a document just delete/replace….
>
> It’s complicated. For the individual document, yes. The problem
> comes because the field is inconsistent _between_ documents, and
> segment merging blows things up.
>
> Consider. I have segment1 with documents indexed with the old
> schema (String in this case). I  change my schema and index the same
> field as a text type.
>
> Eventually, a segment merge happens and these two segments get merged
> into a single new segment. How should the field be handled? Should it
> be defined as String or Text in the new segment? If you convert the docs
> with a Text definition for the field to String,
> you’d lose the ability to search for individual tokens. If you convert the
> String to Text, you don’t have any guarantee that the information is even
> available.
>
> This is just the tip of the iceberg in terms of trying to change the
> definition of a field. Take the case of changing the analysis chain,
> say you use a phonetic filter on a field then decide to remove it and
> do not store the original. Erick might be encoded as “ENXY” so the
> original data is simply not there to convert. Ditto removing a
> stemmer, lowercasing, applying a regex, …...
>
>
> From Mike McCandless:
>
> "This really is the difference between an index and a database:
>  we do not store, precisely, the original documents.  We store
> an efficient derived/computed index from them.  Yes, Solr/ES
> can add database-like behavior where they hold the true original
> source of the document and use that to rebuild Lucene indices
> over time.  But Lucene really is just a "search index" and we
> need to be free to make important improvements with time."
>
> And all that aside, you have to re-index all the docs anyway or
> your search results will be inconsistent. So leaving aside the
> impossible task of covering all the possibilities on the fly, it’s
> better to plan on re-indexing….
>
> Best,
> Erick
>
>
> > On Oct 16, 2020, at 3:16 PM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > "If you want to
> > keep the same field name, you need to delete all of the
> > documents in the index, change the schema, and reindex."
> >
> > actually doesnt re-indexing a document just delete/replace anyways
> assuming
> > the same id?
> >
> > On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Just as a side note,
> >>
> >>> indexed="true"
> >> If you are storing 32K message, you probably are not searching it as a
> >> whole string. So, don't index it. You may also want to mark the field
> >> as 'large' (and lazy):
> >>
> >>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
> >>
> >> When you are going to make it a text field, you will probably be
> >> having the same issues as well.
> >>
> >> And honestly, if you are not storing those fields to search, maybe you
> >> need to consider the architecture. Maybe those fields do not need to
> >> be in Solr at all, but in external systems. Solr (or any search
> >> system) should not be your system of records since - as the other
> >> reply showed - some of the answers are "reindex everything".
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar 
> >> wrote:
> >>>
> >>> I am using solr 8.2
> >>>
> >>> Can I change the schema fieldtype from string to solr.TextField
> >>> without indexing?
> >>>
> >>> >> stored="true"/>
> >>>
> >>> The reason is that string has only 32K char limit where as I am looking
> >> to
> >>> store more than 32K now.
> >>>
> >>> The contents on this field doesn't require any analysis or tokenized
> but
> >> I
> >>> need this field in the queries and as well as output fields.
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Yaswanth Kumar Konathala.
> >>> yaswanth...@gmail.com
> >>
>
>

Re: converting string to solr.TextField

2020-10-16 Thread David Hastings

"If you want to
keep the same field name, you need to delete all of the
documents in the index, change the schema, and reindex."

actually doesnt re-indexing a document just delete/replace anyways assuming
the same id?

On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch 
wrote:

> Just as a side note,
>
> > indexed="true"
> If you are storing 32K message, you probably are not searching it as a
> whole string. So, don't index it. You may also want to mark the field
> as 'large' (and lazy):
>
> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>
> When you are going to make it a text field, you will probably be
> having the same issues as well.
>
> And honestly, if you are not storing those fields to search, maybe you
> need to consider the architecture. Maybe those fields do not need to
> be in Solr at all, but in external systems. Solr (or any search
> system) should not be your system of records since - as the other
> reply showed - some of the answers are "reindex everything".
>
> Regards,
>Alex.
>
> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar 
> wrote:
> >
> > I am using solr 8.2
> >
> > Can I change the schema fieldtype from string to solr.TextField
> > without indexing?
> >
> >  stored="true"/>
> >
> > The reason is that string has only 32K char limit where as I am looking
> to
> > store more than 32K now.
> >
> > The contents on this field doesn't require any analysis or tokenized but
> I
> > need this field in the queries and as well as output fields.
> >
> > --
> > Thanks & Regards,
> > Yaswanth Kumar Konathala.
> > yaswanth...@gmail.com
>

Re: Solr endpoint on the public internet

2020-10-08 Thread David Hastings

Welp. Never mind I refer back to point #1 this is a bad idea 

> On Oct 8, 2020, at 3:01 PM, Alexandre Rafalovitch  wrote:
> 
> The update handlers are now implicitly defined (3 or 4 of them). So,
> it actually needs to be explicitly shadowed and overridden with other
> Noop handler. And block Config API to avoid attackers creating new
> handlers.
> 
> Regards,
>   Alex.
> 
>> On Thu, 8 Oct 2020 at 14:54, David Hastings  wrote:
>> 
>> Well that’s why I suggested deleting the update handler :)
>> 
>>>> On Oct 8, 2020, at 2:52 PM, Walter Underwood  wrote:
>>> 
>>> Let me know where it is and I’ll delete all the documents in your 
>>> collection.
>>> It is easy, just one HTTP request.
>>> 
>>> https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Oct 8, 2020, at 11:49 AM, Alexandre Rafalovitch  
>>>> wrote:
>>>> 
>>>> I think there were past discussions about people doing but they really
>>>> really knew what they were doing from a security perspective, not just
>>>> Solr one.
>>>> 
>>>> You are increasing your risk factor a lot, so you need to think
>>>> through this. What are you protecting and what are you exposing. Are
>>>> you trying to protect the updates? You may be able to do it with - for
>>>> example - read-only docker container, or with embedded Solr or/and
>>>> with reverse proxy.
>>>> 
>>>> Are you trying to protect some of the data from being read? Even harder.
>>>> 
>>>> There are implicit handlers, admin handlers, 'qt' to select query
>>>> parser, etc. Lots of things to think about.
>>>> 
>>>> It just may not be worth it.
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> 
>>>>> On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  
>>>>> wrote:
>>>>> 
>>>>> Hi!
>>>>> 
>>>>> We're looking into the option of setting up search with Solr without an
>>>>> intermediary application. This would mean our backend would index data 
>>>>> into
>>>>> Solr and we would have a public Solr endpoint on the internet that would
>>>>> receive search requests directly.
>>>>> 
>>>>> Since I couldn't find an existing solution similar to ours, I would like 
>>>>> to
>>>>> know whether it's possible to secure Solr in a way that allows anyone only
>>>>> read-access only to collections and how to achieve that. Specifically
>>>>> because of this part of the documentation
>>>>> <https://lucene.apache.org/solr/guide/8_5/securing-solr.html>:
>>>>> 
>>>>> *No Solr API, including the Admin UI, is designed to be exposed to
>>>>> non-trusted parties. Tune your firewall so that only trusted computers and
>>>>> people are allowed access. Because of this, the project will not regard
>>>>> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
>>>>> ask you to report such issues in JIRA.*
>>>>> Is there a way we can restrict read-only access to Solr collections so as
>>>>> to allow users to make search requests directly to it or should we always
>>>>> keep our Solr instances completely private?
>>>>> 
>>>>> Thanks in advance!
>>>>> 
>>>>> Best regards,
>>>>> Marco Godinho
>>>

Re: Solr endpoint on the public internet

2020-10-08 Thread David Hastings

Well that’s why I suggested deleting the update handler :)

> On Oct 8, 2020, at 2:52 PM, Walter Underwood  wrote:
> 
> Let me know where it is and I’ll delete all the documents in your collection.
> It is easy, just one HTTP request.
> 
> https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Oct 8, 2020, at 11:49 AM, Alexandre Rafalovitch  
>> wrote:
>> 
>> I think there were past discussions about people doing but they really
>> really knew what they were doing from a security perspective, not just
>> Solr one.
>> 
>> You are increasing your risk factor a lot, so you need to think
>> through this. What are you protecting and what are you exposing. Are
>> you trying to protect the updates? You may be able to do it with - for
>> example - read-only docker container, or with embedded Solr or/and
>> with reverse proxy.
>> 
>> Are you trying to protect some of the data from being read? Even harder.
>> 
>> There are implicit handlers, admin handlers, 'qt' to select query
>> parser, etc. Lots of things to think about.
>> 
>> It just may not be worth it.
>> 
>> Regards,
>>  Alex.
>> 
>> 
>>> On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  
>>> wrote:
>>> 
>>> Hi!
>>> 
>>> We're looking into the option of setting up search with Solr without an
>>> intermediary application. This would mean our backend would index data into
>>> Solr and we would have a public Solr endpoint on the internet that would
>>> receive search requests directly.
>>> 
>>> Since I couldn't find an existing solution similar to ours, I would like to
>>> know whether it's possible to secure Solr in a way that allows anyone only
>>> read-access only to collections and how to achieve that. Specifically
>>> because of this part of the documentation
>>> :
>>> 
>>> *No Solr API, including the Admin UI, is designed to be exposed to
>>> non-trusted parties. Tune your firewall so that only trusted computers and
>>> people are allowed access. Because of this, the project will not regard
>>> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
>>> ask you to report such issues in JIRA.*
>>> Is there a way we can restrict read-only access to Solr collections so as
>>> to allow users to make search requests directly to it or should we always
>>> keep our Solr instances completely private?
>>> 
>>> Thanks in advance!
>>> 
>>> Best regards,
>>> Marco Godinho
>

Re: Master/Slave

2020-09-30 Thread David Hastings

>whether we should expect Master/Slave replication also to be deprecated

it better not ever be depreciated.  it has been the most reliable mechanism
for its purpose, solr cloud isnt going to replace standalone, if it does,
thats when I guess I stop upgrading or move to elastic

On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:

> Based on the thread below (reading "legacy" as meaning "likely to be
> deprecated in later versions"), we have been working to extract ourselves
> from Master/Slave replication
>
> Most of our collections need to be in two data centers (a read/write copy
> in one local data center: the disaster-recovery-site SolrCloud could be
> read-only). We also need redundancy within each data center for when one
> host or another is unavailable. We implemented this by having different
> SolrClouds in the different data centers; with Master/Slave replication
> pulling data from one of the read/write replicas to each of the Slave
> replicas in the disaster-recovery-site read-only SolrCloud. Additionally,
> for some collections, there is a desire to have local read-only replicas
> remain unchanged for querying during the loading process: for these
> collections, there is a local read/write loading SolrCloud, a local
> read-only querying SolrCloud (normally configured for Master/Slave
> replication from one of the replicas of the loader SolrCloud to both
> replicas of the query SolrCloud, but with Master/Slave disabled when the
> load was in progress on the loader SolrCloud, and with Master/Slave resumed
> after the loaded data passes QA checks).
>
> Based on the thread below, we made an attempt to switch to CDCR. The main
> reason for wanting to change was that CDCR was said to be the supported
> mechanism, and the replacement for Master/Slave replication.
>
> After multiple unsuccessful attempts to get CDCR to work, we ended up with
> reproducible cases of CDCR loosing data in transit. In June, I initiated a
> thread in this group asking for clarification of how/whether CDCR could be
> made reliable. This seemed to me to be met with deafening silence until the
> announcement in July of the release of Solr8.6 and the deprecation of CDCR.
>
> So we are left with the question whether we should expect Master/Slave
> replication also to be deprecated; and if so, with what is it expected to
> be replaced (since not with CDCR)? Or is it now sufficiently safe to assume
> that Master/Slave replication will continue to be supported after all
> (since the assertion that it would be replaced by CDCR has been
> discredited)? In either case, are there other suggested implementations of
> having a read-only SolrCloud receive data from a read/write SolrCloud?
>
>
> Thanks
>
> -Original Message-
> From: Shawn Heisey 
> Sent: Tuesday, May 21, 2019 11:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud (7.3) and Legacy replication slaves
>
> On 5/21/2019 8:48 AM, Michael Tracey wrote:
> > Is it possible set up an existing SolrCloud cluster as the master for
> > legacy replication to a slave server or two?   It looks like another
> option
> > is to use Uni-direction CDCR, but not sure what is the best option in
> this
> > case.
>
> You're asking for problems if you try to combine legacy replication with
> SolrCloud.  The two features are not guaranteed to work together.
>
> CDCR is your best bet.  This replicates from one SolrCloud cluster to
> another.
>
> Thanks,
> Shawn
>

Re: SOLR indexing takes longer time

2020-08-18 Thread David Hastings

Another thing to mention is to make sure the indexer you build doesnt send
commits until its actually done.  Made that mistake with some early in
house indexers.

On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:

> 1. You could write some code to pull the items out of Mongo and dump
> them to disk - if this is still slow, then it's Mongo that's the problem.
> 2. Write a standalone indexer to replace DIH, it's single threaded and
> deprecated anyway.
> 3. Minor point - consider whether you need to index everything every
> time or just the deltas.
> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
> old version you're running.
>
> HTH
>
> Charlie
>
> On 17/08/2020 19:22, Abhijit Pawar wrote:
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>

Number of times in document

2020-08-12 Thread David Hastings

Is there any way to do a query for the minimum number of times a phrase or
string exists in a document?  This has been a request from some users as
other search services (names not to be mentioned) have such a
functionality.  Ive been using solr since 1.4 and i think ive tried finding
this ability before but pretty sure its completely against the standard
ranking functionality, but figured I would send out a feeler if this is
something that can be done

Re: Multiple "df" fields

2020-08-11 Thread David Hastings

why not use a copyfield for indexing?

On Tue, Aug 11, 2020 at 9:59 AM Edward Turner  wrote:

> Hi all,
>
> Is it possible to have multiple "df" fields? (We think the answer is no
> because our experiments did not work when adding multiple "df" values to
> solrconfig.xml -- but we just wanted to double check with those who know
> better.) The reason we would like to do this is that we have two main field
> types (with different analyzers) and we'd like queries without a field to
> be searched over both of them. We could also use copyfields, but this would
> require us to have a common analyzer, which isn't exactly what we want.
>
> An alternative solution is to pre-process the query prior to sending it to
> Solr, so that queries with no field are changed as follows:
>
> q=value -> q=(field1:value OR field2:value)
>
> ... however, we feel a bit uncomfortable doing this though via String
> manipulation.
>
> Is there an obvious way we should tackle this problem that we are missing
> (e.g., which would be cleaner/safer and perhaps works at the Query object
> level)?
>
> Many thanks and best wishes,
>
> Edd
>

Re: solr query returns items with spaces removed

2020-07-29 Thread David Hastings

"Oh, and returning 100K docs is an anti-pattern, if you really need that
many docs consider cursorMark and/or Streaming."

er, i routinely ask for 2+ million records into a single file based on a
query.  I mean not into a web application or anything, its meant to be
processed after the fact, but solr has no issue doing this



On Wed, Jul 29, 2020 at 4:53 PM Erick Erickson 
wrote:

> I don’t think there’s really a canned way to do what you’re asking. A
> custom DocTransformer would probably do the trick though.
>
> You could also create a custom QueryComponent that examined the docs being
> returned and inserted a blank field for a selected number of fields
> (possibly configurable in solrconfig.xml).
>
> Oh, and returning 100K docs is an anti-pattern, if you really need that
> many docs consider cursorMark and/or Streaming.
>
> Best,
> Erick
>
> > On Jul 29, 2020, at 2:55 PM, Teresa McMains 
> wrote:
> >
> > Thanks so much.  Is there any other way to return the data value if it
> exists, otherwise an empty string?  I'm integrating this with a 3rd party
> app which I can't change. When the field is null it isn't showing up in the
> output.
> >
> > -Original Message-
> > From: Erick Erickson 
> > Sent: Wednesday, July 29, 2020 12:49 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: solr query returns items with spaces removed
> >
> > The “def” function goes after the _indexed_ value, so that’s what you’re
> getting back. Try just specifying “fl=INSTRUCTIONS”, and if the value is
> stored that should return the original field value before any analysis is
> done.
> >
> > Why are you using the def function? If the field is absent from the doc,
> nothing will be returned for that field, not even the name. Are you trying
> to insure that a blank field is returned if the field isn’t in the
> document? You can handle that on the client side if so…
> >
> > Best,
> > Erick
> >
> >> On Jul 29, 2020, at 10:34 AM, Teresa McMains 
> wrote:
> >>
> >> _20_Instructions_And_Notes:def(INSTRUCTIONS,%22%22)
> >
>
>

Re: Meow attacks

2020-07-28 Thread David Hastings

so, your zookeeper/solr servers have public facing addresses/ports?



On Tue, Jul 28, 2020 at 4:41 PM Odysci  wrote:

> Folks,
>
> I suspect one of our Zookeeper installations on AWS was subject to a Meow
> attack (
>
> https://arstechnica.com/information-technology/2020/07/more-than-1000-databases-have-been-nuked-by-mystery-meow-attack/
> )
>
> Basically, the configuration for one of our collections disappeared from
> the Zookeeper tree (when looking at the Solr interface), and it left
> several files ending in "-meow"
> Before I realized it, I stopped and restarted the ZK and Solr machines (as
> part of ubuntu updates), and when ZK didn't find the configuration for a
> collection, it deleted the collection from Solr. At least that's what I
> suspect happened.
>
> Fortunately it affected a very small index and we had backups. But it is
> very worrisome.
> Has anyone had any problems with this?
> Is there any type of log that I can check to sort out how this happened?
> The ZK log complained that the configs for the collection were not there,
> but that's about it.
>
> and, is there a better way to protect against such attacks?
> Thanks
>
> Reinaldo
>

Re: sorting help

2020-07-15 Thread David Hastings

thanks, ill check the admin, didnt want to send a big clock of text but:


   -
  -

  Tokenizer:
  org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
  solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
  -

  Token Filters:
  org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
  solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
  org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
  solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
  org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
  ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
  luceneMatchVersion: 7.1.0
   -

   Query Analyzer:
   <http://192.168.1.33:7300/solr/#/mega/analysis?analysis.fieldname=alphatitle>
   org.apache.solr.analysis.TokenizerChain
  -

  Tokenizer:
  org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
  solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
  -

  Token Filters:
  org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
  solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
  org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
  solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
  org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
  ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
  luceneMatchVersion: 7.1.0


On Wed, Jul 15, 2020 at 4:47 PM Erick Erickson 
wrote:

> I’d look two places:
>
> 1> try the admin/analysis page from the admin UI. In particular, look at
> what tokens actually get in the index.
>
> 2> again, the admin UI will let you choose the field (alphatitle and
> title) and see what the actual indexed tokens are.
>
> Both have the issue that I don’t know what tokenizer you are using. For
> sorting it better be something
> like KeywordTokenizer. Anything that breaks up the input into separate
> tokens will produce surprises.
>
> And unless you have lowercaseFilter in front of your patternreplace,
> you’re removing uppercase characters.
>
> Best,
> Erick
>
> > On Jul 15, 2020, at 3:06 PM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > howdy,
> > i have a field that sorts fine all other content, and i cant seem to
> debug
> > why it wont sort for me on this one chunk of it.
> > "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}},
> "response
> > ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
> > "title":"Finance,
> > consolidation and rescheduling of debts", { "title":"Rights in former
> > German Islands in Pacific", },
> >
> > its using a copyfield from "title" to "alphatitle" that replaces all
> > punctuation
> > pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
> >
> > and if i use just title it flips:
> >
> > "title":"Finance, consolidation and rescheduling of debts"}, {
> "title":"Rights
> > in former German Islands in Pacific"}, { "title":"Money orders"}]
> >
> > and im banging my head trying to figure out what it is about this
> > content in particular that is not sorting the way I would expect.
> > don't suppose someone would be able to lead me to a good place to look?
>
>

sorting help

2020-07-15 Thread David Hastings

howdy,
i have a field that sorts fine all other content, and i cant seem to debug
why it wont sort for me on this one chunk of it.
"sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}}, "response
":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
"title":"Finance,
consolidation and rescheduling of debts", { "title":"Rights in former
German Islands in Pacific", },

its using a copyfield from "title" to "alphatitle" that replaces all
punctuation
pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory

and if i use just title it flips:

"title":"Finance, consolidation and rescheduling of debts"}, { "title":"Rights
in former German Islands in Pacific"}, { "title":"Money orders"}]

and im banging my head trying to figure out what it is about this
content in particular that is not sorting the way I would expect.
don't suppose someone would be able to lead me to a good place to look?

Re: How to determine why solr stops running?

2020-06-29 Thread David Hastings

little nit picky note here, use 31gb, never 32.

On Mon, Jun 29, 2020 at 1:45 PM Ryan W  wrote:

> It figures it would happen again a couple hours after I suggested the issue
> might be resolved.  Just now, Solr stopped running.  I cleared the cache in
> my app a couple times around the time that it happened, so perhaps that was
> somehow too taxing for the server.  However, I've never allocated so much
> RAM to a website before, so it's odd that I'm getting these failures.  My
> colleagues were astonished when I said people on the solr-user list were
> telling me I might need 32GB just for solr.
>
> I manage another project that uses Drupal + Solr, and we have a total of
> 8GB of RAM on that server and Solr never, ever stops.  I've been managing
> that site for years and never seen a Solr outage.  On that project,
> Drupal + Solr is OK with 8GB, but somehow this other project needs 64 GB or
> more?
>
> "The thing that’s unsettling about this is that assuming you were hitting
> OOMs, and were running the OOM-killer script, you _should_ have had very
> clear evidence that that was the cause."
>
> How do I know if I'm running the OOM-killer script?
>
> Thank you.
>
> On Mon, Jun 29, 2020 at 12:12 PM Erick Erickson 
> wrote:
>
> > The thing that’s unsettling about this is that assuming you were hitting
> > OOMs,
> > and were running the OOM-killer script, you _should_ have had very clear
> > evidence that that was the cause.
> >
> > If you were not running the killer script, the apologies for not asking
> > about that
> > in the first place. Java’s performance is unpredictable when OOMs happen,
> > which is the point of the killer script: at least Solr stops rather than
> do
> > something inexplicable.
> >
> > Best,
> > Erick
> >
> > > On Jun 29, 2020, at 11:52 AM, David Hastings <
> > hastings.recurs...@gmail.com> wrote:
> > >
> > > sometimes just throwing money/ram/ssd at the problem is just the best
> > > answer.
> > >
> > > On Mon, Jun 29, 2020 at 11:38 AM Ryan W  wrote:
> > >
> > >> Thanks everyone. Just to give an update on this issue, I bumped the
> RAM
> > >> available to Solr up to 16GB a couple weeks ago, and haven’t had any
> > >> problem since.
> > >>
> > >>
> > >> On Tue, Jun 16, 2020 at 1:00 PM David Hastings <
> > >> hastings.recurs...@gmail.com>
> > >> wrote:
> > >>
> > >>> me personally, around 290gb.  as much as we could shove into them
> > >>>
> > >>> On Tue, Jun 16, 2020 at 12:44 PM Erick Erickson <
> > erickerick...@gmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> How much physical RAM? A rule of thumb is that you should allocate
> no
> > >>> more
> > >>>> than 25-50 percent of the total physical RAM to Solr. That's
> > >> cumulative,
> > >>>> i.e. the sum of the heap allocations across all your JVMs should be
> > >> below
> > >>>> that percentage. See Uwe Schindler's mmapdirectiry blog...
> > >>>>
> > >>>> Shot in the dark...
> > >>>>
> > >>>> On Tue, Jun 16, 2020, 11:51 David Hastings <
> > >> hastings.recurs...@gmail.com
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>>> To add to this, i generally have solr start with this:
> > >>>>> -Xms31000m-Xmx31000m
> > >>>>>
> > >>>>> and the only other thing that runs on them are maria db gallera
> > >> cluster
> > >>>>> nodes that are not in use (aside from replication)
> > >>>>>
> > >>>>> the 31gb is not an accident either, you dont want 32gb.
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Jun 16, 2020 at 11:26 AM Shawn Heisey  >
> > >>>> wrote:
> > >>>>>
> > >>>>>> On 6/11/2020 11:52 AM, Ryan W wrote:
> > >>>>>>>> I will check "dmesg" first, to find out any hardware error
> > >>> message.
> > >>>>>>
> > >>>>>> 
> > >>>>>>
> > >>>>>>> [1521232.781801] Out of memory: Kill process 117529 (httpd)
> > >> score 9
> > >>>> or
> > >>>>>>> sacrifice child
> &

Re: How to determine why solr stops running?

2020-06-29 Thread David Hastings

sometimes just throwing money/ram/ssd at the problem is just the best
answer.

On Mon, Jun 29, 2020 at 11:38 AM Ryan W  wrote:

> Thanks everyone. Just to give an update on this issue, I bumped the RAM
> available to Solr up to 16GB a couple weeks ago, and haven’t had any
> problem since.
>
>
> On Tue, Jun 16, 2020 at 1:00 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > me personally, around 290gb.  as much as we could shove into them
> >
> > On Tue, Jun 16, 2020 at 12:44 PM Erick Erickson  >
> > wrote:
> >
> > > How much physical RAM? A rule of thumb is that you should allocate no
> > more
> > > than 25-50 percent of the total physical RAM to Solr. That's
> cumulative,
> > > i.e. the sum of the heap allocations across all your JVMs should be
> below
> > > that percentage. See Uwe Schindler's mmapdirectiry blog...
> > >
> > > Shot in the dark...
> > >
> > > On Tue, Jun 16, 2020, 11:51 David Hastings <
> hastings.recurs...@gmail.com
> > >
> > > wrote:
> > >
> > > > To add to this, i generally have solr start with this:
> > > > -Xms31000m-Xmx31000m
> > > >
> > > > and the only other thing that runs on them are maria db gallera
> cluster
> > > > nodes that are not in use (aside from replication)
> > > >
> > > > the 31gb is not an accident either, you dont want 32gb.
> > > >
> > > >
> > > > On Tue, Jun 16, 2020 at 11:26 AM Shawn Heisey 
> > > wrote:
> > > >
> > > > > On 6/11/2020 11:52 AM, Ryan W wrote:
> > > > > >> I will check "dmesg" first, to find out any hardware error
> > message.
> > > > >
> > > > > 
> > > > >
> > > > > > [1521232.781801] Out of memory: Kill process 117529 (httpd)
> score 9
> > > or
> > > > > > sacrifice child
> > > > > > [1521232.782908] Killed process 117529 (httpd), UID 48,
> > > > > total-vm:675824kB,
> > > > > > anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB
> > > > > >
> > > > > > Is this a relevant "Out of memory" message?  Does this suggest an
> > OOM
> > > > > > situation is the culprit?
> > > > >
> > > > > Because this was in the "dmesg" output, it indicates that it is the
> > > > > operating system killing programs because the *system* doesn't have
> > any
> > > > > memory left.  It wasn't Java that did this, and it wasn't Solr that
> > was
> > > > > killed.  It very well could have been Solr that was killed at
> another
> > > > > time, though.
> > > > >
> > > > > The process that it killed this time is named httpd ... which is
> most
> > > > > likely the Apache webserver.  Because the UID is 48, this is
> probably
> > > an
> > > > > OS derived from Redhat, where the "apache" user has UID and GID 48
> by
> > > > > default.  Apache with its default config can be VERY memory hungry
> > when
> > > > > it gets busy.
> > > > >
> > > > > > -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
> > > > >
> > > > > This says that you started Solr with the default 512MB heap.  Which
> > is
> > > > > VERY VERY small.  The default is small so that Solr will start on
> > > > > virtually any hardware.  Almost every user must increase the heap
> > size.
> > > > > And because the OS is killing processes, it is likely that the
> system
> > > > > does not have enough memory installed for what you have running on
> > it.
> > > > >
> > > > > It is generally not a good idea to share the server hardware
> between
> > > > > Solr and other software, unless the system has a lot of spare
> > > resources,
> > > > > memory in particular.
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > >
> > >
> >
>

Re: Solr 7.6 optimize index size increase

2020-06-16 Thread David Hastings

I cant give you a 100% true answer but ive experienced this, and what
"seemed" to happen to me was that the optimize would start, and that will
drive the size up by 3 fold, and if you out of disk space in the process
the optimize will quit since, it cant optimize, and leave the live index
pieces in tact, so now you have the "current" index as well as the
"optimized" fragments

i cant say for certain thats what you ran into, but we found that if you
get an expanding disk it will keep growing and prevent this from happening,
then the index will contract and the disk will shrink back to only what it
needs.  saved me a lot of headaches not needing to ever worry about disk
space

On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
 wrote:

>
> when optimize command is issued, the expectation after the completion of
> optimization process is that the index size either decreases or at most
> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
> is issued, some of the shard's transient or older segment files are not
> deleted. This is happening randomly across all shards. When unnoticed these
> transient files makes disk full. Currently it is handled through monitors,
> but question is what is causing the transient/older files remains there.
> Are there any specific race conditions which laves the older files not
> being deleted?
> Any pointers around this will be helpful.
>  TIA

Re: How to determine why solr stops running?

2020-06-16 Thread David Hastings

me personally, around 290gb.  as much as we could shove into them

On Tue, Jun 16, 2020 at 12:44 PM Erick Erickson 
wrote:

> How much physical RAM? A rule of thumb is that you should allocate no more
> than 25-50 percent of the total physical RAM to Solr. That's cumulative,
> i.e. the sum of the heap allocations across all your JVMs should be below
> that percentage. See Uwe Schindler's mmapdirectiry blog...
>
> Shot in the dark...
>
> On Tue, Jun 16, 2020, 11:51 David Hastings 
> wrote:
>
> > To add to this, i generally have solr start with this:
> > -Xms31000m-Xmx31000m
> >
> > and the only other thing that runs on them are maria db gallera cluster
> > nodes that are not in use (aside from replication)
> >
> > the 31gb is not an accident either, you dont want 32gb.
> >
> >
> > On Tue, Jun 16, 2020 at 11:26 AM Shawn Heisey 
> wrote:
> >
> > > On 6/11/2020 11:52 AM, Ryan W wrote:
> > > >> I will check "dmesg" first, to find out any hardware error message.
> > >
> > > 
> > >
> > > > [1521232.781801] Out of memory: Kill process 117529 (httpd) score 9
> or
> > > > sacrifice child
> > > > [1521232.782908] Killed process 117529 (httpd), UID 48,
> > > total-vm:675824kB,
> > > > anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB
> > > >
> > > > Is this a relevant "Out of memory" message?  Does this suggest an OOM
> > > > situation is the culprit?
> > >
> > > Because this was in the "dmesg" output, it indicates that it is the
> > > operating system killing programs because the *system* doesn't have any
> > > memory left.  It wasn't Java that did this, and it wasn't Solr that was
> > > killed.  It very well could have been Solr that was killed at another
> > > time, though.
> > >
> > > The process that it killed this time is named httpd ... which is most
> > > likely the Apache webserver.  Because the UID is 48, this is probably
> an
> > > OS derived from Redhat, where the "apache" user has UID and GID 48 by
> > > default.  Apache with its default config can be VERY memory hungry when
> > > it gets busy.
> > >
> > > > -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
> > >
> > > This says that you started Solr with the default 512MB heap.  Which is
> > > VERY VERY small.  The default is small so that Solr will start on
> > > virtually any hardware.  Almost every user must increase the heap size.
> > > And because the OS is killing processes, it is likely that the system
> > > does not have enough memory installed for what you have running on it.
> > >
> > > It is generally not a good idea to share the server hardware between
> > > Solr and other software, unless the system has a lot of spare
> resources,
> > > memory in particular.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
>

Re: How to determine why solr stops running?

2020-06-16 Thread David Hastings

To add to this, i generally have solr start with this:
-Xms31000m-Xmx31000m

and the only other thing that runs on them are maria db gallera cluster
nodes that are not in use (aside from replication)

the 31gb is not an accident either, you dont want 32gb.


On Tue, Jun 16, 2020 at 11:26 AM Shawn Heisey  wrote:

> On 6/11/2020 11:52 AM, Ryan W wrote:
> >> I will check "dmesg" first, to find out any hardware error message.
>
> 
>
> > [1521232.781801] Out of memory: Kill process 117529 (httpd) score 9 or
> > sacrifice child
> > [1521232.782908] Killed process 117529 (httpd), UID 48,
> total-vm:675824kB,
> > anon-rss:181844kB, file-rss:0kB, shmem-rss:0kB
> >
> > Is this a relevant "Out of memory" message?  Does this suggest an OOM
> > situation is the culprit?
>
> Because this was in the "dmesg" output, it indicates that it is the
> operating system killing programs because the *system* doesn't have any
> memory left.  It wasn't Java that did this, and it wasn't Solr that was
> killed.  It very well could have been Solr that was killed at another
> time, though.
>
> The process that it killed this time is named httpd ... which is most
> likely the Apache webserver.  Because the UID is 48, this is probably an
> OS derived from Redhat, where the "apache" user has UID and GID 48 by
> default.  Apache with its default config can be VERY memory hungry when
> it gets busy.
>
> > -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
>
> This says that you started Solr with the default 512MB heap.  Which is
> VERY VERY small.  The default is small so that Solr will start on
> virtually any hardware.  Almost every user must increase the heap size.
> And because the OS is killing processes, it is likely that the system
> does not have enough memory installed for what you have running on it.
>
> It is generally not a good idea to share the server hardware between
> Solr and other software, unless the system has a lot of spare resources,
> memory in particular.
>
> Thanks,
> Shawn
>

Re: Getting rid of zookeeper

2020-06-09 Thread David Hastings

Zookeeper is annoying to both set up and manage, but then again the same
thing can be said about solr cloud.  not certain why you would want to deal
with either

On Tue, Jun 9, 2020 at 3:29 PM S G  wrote:

> Hello,
>
> I recently stumbled across KIP-500: Replace ZooKeeper with a Self-Managed
> Metadata Quorum
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
> >
> Elastic-search does this too.
> And so do many other systems.
>
> Is there some work to go in this direction?
> It would be nice to get rid of another totally disparate system.
> Hardware savings would be nice to have too.
>
> Best,
> SG
>

Re: Script to check if solr is running

2020-06-08 Thread David Hastings

>
> Why have a cold backup and then switch?
>

my current set up is:
1. master indexer
2. master slave on a release/commit basis
3. 3 live slave searching nodes in two data different centers


the three live nodes are in front of nginx load balancing and they are
mostly hot but not all of them, i found that having all load into one made
the performance significantly better, but if one of them goes down theres a
likelihood that the other two went with it, they are also part of a
mysql gallera cluster and it has a possibility of going down (innodb can be
annoying), so the script will go through all three of the live slaves until
it has to fall back to the master slave, i know the cold master will work,
mostly out of faith, but if i lose four servers all at the same time, i
have larger problems to worry about than searching.

just adaptation over time, I cant say its the best set up but i can say it
operates pretty well, very well speed wise keeping one searcher super hot
with two clones ready to jump in if needed



On Mon, Jun 8, 2020 at 12:30 PM Walter Underwood 
wrote:

> I could write a script, too, though I’d do it with straight shell code.
> But then I’d have to test it, check it in somewhere, document it for ops,
> install it, ...
>
> Instead, when we switch from monit, I'll start with one of these systemd
> configs.
>
> https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1 <
> https://gist.github.com/hammady/3d7b5964c7b0f90997865ebef40bf5e1>
>
> https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup
> <
> https://netgen.io/blog/keeping-apache-solr-up-and-running-on-ez-platform-setup
> >
> https://issues.apache.org/jira/browse/SOLR-14410 <
> https://issues.apache.org/jira/browse/SOLR-14410>
>
> Why have a cold backup and then switch? Every time I see that config, I
> wonder why people don’t have both servers live behind a load balancer. How
> do you know the cold server will work?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 8, 2020, at 9:20 AM, Dave  wrote:
> >
> > A simple Perl script would be able to cover this, I have a cron job Perl
> script that does a search with an expected result, if the result isn’t
> there it fails over to a backup search server, sends me an email, and I fix
> what’s wrong. The backup search server is a direct clone of the live server
> and just as strong, no interruption (aside from the five minute window)
> >
> > If you need a hand with this I’d gladly help, everything I run is Linux
> based but it’s a simple curl command and server switch on failure.
> >
> >> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
> >>
> >> Use the solution described by Walter. This allows you to automatically
> restart in case of failure and is also cleaner than defining a cronjob.
> Otherwise This would be another dependency one needs to keep in mind -
> means if there is an issue and someone does not know the system the person
> has to look at different places which never is good
> >>
> >>> Am 04.06.2020 um 18:36 schrieb Ryan W :
> >>>
> >>> Does anyone have a script that checks if solr is running and then
> starts it
> >>> if it isn't running?  Occasionally my solr stops running even if there
> has
> >>> been no Apache restart.  I haven't been able to determine the root
> cause,
> >>> so the next best thing might be to check every 15 minutes or so if it's
> >>> running and run it if it has stopped.
> >>>
> >>> Thanks.
>
>

Re: What is the logical order of applying sorts in SOLR?

2020-05-16 Thread David Hastings

the bq parameter, heres a SO thread for it:

https://stackoverflow.com/questions/45150856/how-to-know-when-to-use-solr-bq-vs-bf-and-how-to-apply-query-logic

On Sat, May 16, 2020 at 6:27 PM Stephen Lewis Bianamara <
stephen.bianam...@gmail.com> wrote:

> Hi Paras,
>
> I'm not sure I follow. How would one replace sort by id with boosting?
>
> Thanks,
> Stephen
>
> On Fri, May 15, 2020, 7:43 AM Paras Lehana
>  wrote:
>
> > As a workaround, can you try field boosting?
> >
> > On Tue, 12 May 2020 at 00:45, Stephen Lewis Bianamara <
> > stephen.bianam...@gmail.com> wrote:
> >
> > > Hi SOLR Community,
> > >
> > > What is the order of operations which SOLR applies to sorting? I've
> > > observed many times and across SOLR versions that a restrictive filter
> > with
> > > a sort takes an extremely long time to return, suggesting to me that
> the
> > > SORT is applied before the filter.
> > >
> > > An example situation is querying for fq:Foo=Bar vs querying for
> > fq:Foo=Bar
> > > sort by Id desc. I've observed over many SOLR versions and collections
> > that
> > > the former is orders of magnitude cheaper and quicker to respond, even
> > when
> > > the result set is tiny (10-100).
> > >
> > > Does anyone in this forum know whether this is the default behavior and
> > > whether there is any way through the API or SOLR configuration to apply
> > > sorts after filters?
> > >
> > > Thanks,
> > > Stephen
> > >
> >
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Development Engineer, *Auto-Suggest*,
> > IndiaMART InterMESH Ltd,
> >
> > 11th Floor, Tower 2, Assotech Business Cresterra,
> > Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305
> >
> > Mob.: +91-9560911996
> > Work: 0120-4056700 | Extn:
> > *1196*
> >
> > --
> > *
> > *
> >
> >  
> >
>

Re: Stopwords impact on search

2020-04-24 Thread David Hastings

you should never use the stopword filter unless you have a very specific
purpose

On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:

> Hi everyone,
>
> What is, if any, the impact of stopwords in to my search ranking quality?
> Will my ranking improve is I do not index stopwords?
>
> I'm trying to figure out if I should use the stopword filter or not.
>
> Thanks in advanced.
>
> Steve
>

Re: Solr index size has increased in solr 7.7.2

2020-04-15 Thread David Hastings

i wouldnt worry about the index size until you get above a half terabyte or
so.  adding doc values and other features means you sacrifice things that
dont matter, like size.  memory and ssd's are cheap.

On Wed, Apr 15, 2020 at 1:21 PM Rajdeep Sahoo 
wrote:

> Hi all
> We are migrating from solr 4.6 to solr 7.7.2.
> In solr 4.6 the size was 2.5 gb but here in solr 7.7.2 the solr index size
> is showing 6.8 gb with the same no of documents. Is it expected behavior or
> any suggestions how to optimize the size.
>

Re: How do you restrict access to Solr?

2020-03-16 Thread David Hastings

master slave is the idea that you have an indexing server you do all
indexing to and a search server that replicates the index, to deliver the
results etc.  if you keep the indexer separate you can tune it differently
as well as protect the data.  also means you can remove the delete/update
request handlers from the slave/searcher

yes security by obscurity isnt ideal, but the over head of adding
authentication to requests i find unnecessary,

On Mon, Mar 16, 2020 at 11:16 AM Ryan W  wrote:

> On Mon, Mar 16, 2020 at 11:09 AM Walter Underwood 
> wrote:
>
> > What access do you want to prevent? How do you prefer to authenticate?
> > How do you manage users or roles? Master/slave or Solr Cloud?
> >
>
> I want to prevent access to the admin UI.
>
> I don't want to manage users or roles, preferably.  I have only one user:
> staff.  I want to prevent the public from accessing the admin UI.  I'd be
> happy if I could set an IP address whitelist... especially if I don't have
> to learn a new framework (which I will never use for any other purpose) to
> do it.
>
> I don't know what master/slave is.  These are new concepts that weren't
> required to secure Solr prior to 7x, and this is my first project using a
> version after 6x.
>
> Thanks!
>
>
>
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Mar 16, 2020, at 7:44 AM, Ryan W  wrote:
> > >
> > > How do you, personally, do it?  Do you use IPTables?  Basic
> > Authentication
> > > Plugin? Something else?
> > >
> > > I'm asking in part so I'l have something to search for.  I don't know
> > where
> > > I should begin, so I figured I would ask how others do it.
> > >
> > > I haven't been able to find anything that works, so if you can tell me
> > what
> > > works for you, I can at least narrow it down a bit and do some Google
> > > searches.  Do I need to learn Solr's plugin system?  Am I starting in
> the
> > > right place if I follow this document:
> > >
> >
> https://lucene.apache.org/solr/guide/7_0/rule-based-authorization-plugin.html#rule-based-authorization-plugin
> > >
> > > Initially, the above document seems far too comprehensive for my needs.
> > I
> > > just want to block access to the Solr admin UI, and the list of
> > predefined
> > > permissions in that document don't seem to be relevant.  Also, it seems
> > > unlikely this plugin system is necessary just to control access to the
> > > admin UI... or maybe it necessary?
> > >
> > > In any case, what is your approach?
> > >
> > > I'm using version 7.7.2 of Solr.
> > >
> > > Thanks!
> >
> >
>

Re: How do you restrict access to Solr?

2020-03-16 Thread David Hastings

Honestly?  I know this isnt what youre going to want to hear, but security
through obscurity.  no one else knows what port the servers on, and its not
accessible from anything outside of the internal network.  if your solr
install can be accessed from an external IP you have much larger issues.

On Mon, Mar 16, 2020 at 10:44 AM Ryan W  wrote:

> How do you, personally, do it?  Do you use IPTables?  Basic Authentication
> Plugin? Something else?
>
> I'm asking in part so I'l have something to search for.  I don't know where
> I should begin, so I figured I would ask how others do it.
>
> I haven't been able to find anything that works, so if you can tell me what
> works for you, I can at least narrow it down a bit and do some Google
> searches.  Do I need to learn Solr's plugin system?  Am I starting in the
> right place if I follow this document:
>
> https://lucene.apache.org/solr/guide/7_0/rule-based-authorization-plugin.html#rule-based-authorization-plugin
>
> Initially, the above document seems far too comprehensive for my needs.  I
> just want to block access to the Solr admin UI, and the list of predefined
> permissions in that document don't seem to be relevant.  Also, it seems
> unlikely this plugin system is necessary just to control access to the
> admin UI... or maybe it necessary?
>
> In any case, what is your approach?
>
> I'm using version 7.7.2 of Solr.
>
> Thanks!
>

Re: [SUSPICIOUS] Re: Best Practises around relevance tuning per query

2020-02-18 Thread David Hastings

I don’t think anyone is responding because it’s too focused of a use case, 
where you just simply have to figure out an alternative on your own.  

> On Feb 19, 2020, at 12:28 AM, Ashwin Ramesh  wrote:
> 
> ping on this :)
> 
>> On Tue, Feb 18, 2020 at 11:50 AM Ashwin Ramesh  wrote:
>> 
>> Hi,
>> 
>> We are in the process of applying a scoring model to our search results.
>> In particular, we would like to add scores for documents per query and user
>> context.
>> 
>> For example, we want to have a score from 500 to 1 for the top 500
>> documents for the query “dog” for users who speak US English.
>> 
>> We believe it becomes infeasible to store these scores in Solr because we
>> want to update the scores regularly, and the number of scores increases
>> rapidly with increased user attributes.
>> 
>> One solution we explored was to store these scores in a secondary data
>> store, and use this at Solr query time with a boost function such as:
>> 
>> `bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
>> mul(termfreq(id,'ID-500'),1)`
>> 
>> We have over a hundred thousand documents in one Solr collection, and
>> about fifty million in another Solr collection. We have some queries for
>> which roughly 80% of the results match, although this is an edge case. We
>> wanted to know the worst case performance, so we tested with such a query.
>> For both of these collections we found the a message similar to the
>> following in the Solr cloud logs (tested on a laptop):
>> 
>> Elapsed time: 5020. Exceeded allowed search time: 5000 ms.
>> 
>> We then tried using the following boost, which seemed simpler:
>> 
>> `boost=if(query($qq), 10, 1)=id:(ID-1 OR ID-2 OR … OR ID-500)`
>> 
>> We then saw the following in the Solr cloud logs:
>> 
>> `The request took too long to iterate over terms.`
>> 
>> All responses above took over 5000 milliseconds to return.
>> 
>> We are considering Solr’s re-ranker, but I don’t know how we would use
>> this without pushing all the query-context-document scores to Solr.
>> 
>> 
>> The alternative solution that we are currently considering involves
>> invoking multiple solr queries.
>> 
>> This means we would make a request to solr to fetch the top N results (id,
>> score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.
>> 
>> Another request would be made using a filter query with a set of doc ids
>> that we know are high value for the user’s query. E.g. q=*:*,
>> fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.
>> 
>> We would then do a reranking phase in our service layer.
>> 
>> Do you have any suggestions for known patterns of how we can store and
>> retrieve scores per user context and query?
>> 
>> Regards,
>> Ash & Spirit.
>> 
> 
> -- 
> **
> ** Empowering the world to design
> Also, we're 
> hiring. Apply here! 
> 
>   
>    
>     
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings

interesting, i cant seem to find anything on Phrase IDF, dont suppose you
have a link or two i could look at by chance?

On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
wrote:

> At Infoseek, we used “glue words” to build phrase tokens. It was really
> effective.
> Phrase IDF is powerful stuff.
>
> Luckily for you, the patent on that has expired. :-)
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 10:46 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > i use stop words for building shingles into "interesting phrases" for my
> > machine teacher/students, so i wouldnt say theres no reason, however my
> use
> > case is very specific.  Otherwise yeah, theyre gone for all practical
> > reasons/search scenarios.
> >
> > On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> > wrote:
> >
> >> Why are you using stopwords? I would need a really, really good reason
> to
> >> use those.
> >>
> >> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> >> used them and
> >> I’ve been a search engineer since 1997.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> >> wrote:
> >>>
> >>> Hi
> >>>
> >>> I've run into an issue with creating a Managed Stopwords list that has
> >> the
> >>> same name as a previously deleted list. Going through the same flow
> with
> >>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
> >> missing
> >>> something or did I discover a bug in Solr?
> >>>
> >>> On a newly started solr with the techproducts core:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> The second PUT request results in a status 500 with error
> >>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >>>
> >>> Similar requests for synonyms work fine, no matter how many times I
> >> repeat
> >>> the CREATE/DELETE/RELOAD cycle:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl -X DELETE
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>>
> >>> Reloading after creating the Stopwords list but not after deleting it
> >> works
> >>> without error too on a fresh techproducts core (you'll have to remove
> the
> >>> directory from disk and create the core again after running the
> previous
> >>> commands).
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings

i use stop words for building shingles into "interesting phrases" for my
machine teacher/students, so i wouldnt say theres no reason, however my use
case is very specific.  Otherwise yeah, theyre gone for all practical
reasons/search scenarios.

On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
wrote:

> Why are you using stopwords? I would need a really, really good reason to
> use those.
>
> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> used them and
> I’ve been a search engineer since 1997.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> wrote:
> >
> > Hi
> >
> > I've run into an issue with creating a Managed Stopwords list that has
> the
> > same name as a previously deleted list. Going through the same flow with
> > Managed Synonyms doesn't result in this unexpected behaviour. Am I
> missing
> > something or did I discover a bug in Solr?
> >
> > On a newly started solr with the techproducts core:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > The second PUT request results in a status 500 with error
> > msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >
> > Similar requests for synonyms work fine, no matter how many times I
> repeat
> > the CREATE/DELETE/RELOAD cycle:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >
> > Reloading after creating the Stopwords list but not after deleting it
> works
> > without error too on a fresh techproducts core (you'll have to remove the
> > directory from disk and create the core again after running the previous
> > commands).
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> > CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> > can be completed twice. (Again, on a freshly created techproducts core.)
> > Only the third attempt to create a list results in an error. Synonyms can
> > still be created and deleted repeatedly after this.
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >

Re: How to compute index size

2020-02-03 Thread David Hastings

Yup, I find the right calculation to be as much ram as the server can take,
and as much SSD space as it will hold, when you run out, buy another server
and repeat.  machines/ram/SSD's are cheap.  just get as much as you can.

On Mon, Feb 3, 2020 at 11:59 AM Walter Underwood 
wrote:

> What he said.
>
> But if you must have a number, assume that the index will be as big as
> your (text) data. It might be 2X bigger or 2X smaller. Or 3X or 4X, but
> that is a starting point. Once you start updating, the index might get as
> much as 2X bigger before merges.
>
> Do NOT try to get by with the smallest possible RAM or disk.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 3, 2020, at 5:28 AM, Erick Erickson 
> wrote:
> >
> > I’ve always had trouble with that advice, that RAM size should be JVM +
> index size. I’ve seen 300G indexes (as measured by the size of the
> data/index directory) run in 128G of memory.
> >
> > Here’s the long form:
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > But the short form is “stress test and see”.
> >
> > To answer your question, though, when people say “index size” they’re
> usually referring to the size on disk as I mentioned above.
> >
> > Best,
> > Erick
> >
> >> On Feb 3, 2020, at 4:24 AM, Mohammed Farhan Ejaz 
> wrote:
> >>
> >> Hello All,
> >>
> >> I want to size the RAM for my Solr cloud instance. The thumb rule is
> your
> >> total RAM size should be = (JVM size + index size)
> >>
> >> Now I have a simple question, How do I know my index size? A simple
> method,
> >> perhaps from the Solr cloud admin UI or an API?
> >>
> >> My assumption so far is the total segment info size is the same as the
> >> index size.
> >>
> >> Thanks & Regards
> >> Farhan
> >
>
>

Re: Easiest way to export the entire index

2020-01-29 Thread David Hastings

i do this often and just create a 30gb file using wget,

On Wed, Jan 29, 2020 at 10:21 AM Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Amanda,
> I assume that you have all the fields stored so you will be able to export
> full document.
>
> Several thousands records should not be too much to use regular start+rows
> to paginate results, but the proper way of doing that would be to use
> cursors. Adjust page size to avoid creating huge responses and you can use
> curl or some similar tool to avoid using admin console. I did a quick
> search and there are several blog posts with scripts that does what you
> need.
>
> HTH,
> Emir
>
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 29 Jan 2020, at 15:43, Amanda Shuman  wrote:
> >
> > Dear all:
> >
> > I've been asked to produce a JSON file of our index so it can be combined
> > and indexed with other records. (We run solr 5.3.1 on this project; we're
> > not going to upgrade, in part because funding has ended.) The index has
> > several thousand rows, but nothing too drastic. Unfortunately, this is
> too
> > much to handle for a simple query dump from the admin console. I tried to
> > follow instructions related to running /export directly but I guess the
> > export handler isn't installed. I tried to divide the query into rows,
> but
> > after a certain amount it freezes, and it also freezes when I try to
> limit
> > rows (e.g., rows 501-551 freezes the console). Is there any other way to
> > export the index short of having to install the export handler
> considering
> > we're not working on this project anyone?
> >
> > Thanks,
> > Amanda
> >
> > --
> > Dr. Amanda Shuman
> > Researcher and Lecturer, Institute of Chinese Studies, University of
> > Freiburg
> > Coordinator for the MA program in Modern China Studies
> > Database Administrator, The Maoist Legacy 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 96748
>
>

Re: How to negate numeric range query - or - how to get records NOT matching a certain numeric range

2020-01-24 Thread David Hastings

just tried  "fq":"NOT year:[1900 TO 2000]"}},
on my data et and also worked as expected, mind if i ask why:
(u_lastLendingDate_combined_ls_ns:([8610134693 TO 8611935823]))

there are ()'s around your range query?

On Fri, Jan 24, 2020 at 11:01 AM David Hastings <
hastings.recurs...@gmail.com> wrote:

> having
> fq=NOT field:value
> works for me,
>
> On Fri, Jan 24, 2020 at 10:56 AM Sebastian Riemer 
> wrote:
>
>> Hi all!
>>
>>
>>
>> Consider a query containing fq-params like this:
>>
>>
>>
>> "*fq*":["tenant_id:1",
>>
>> "u_markedAsDeleted_b:false",
>>
>> "u_id_s:[* TO *]",
>>
>> "(u_lastLendingDate_combined_ls_ns:([8610134693 TO 8611935823]))"
>> ]
>>
>>
>>
>> This gives me a list of users, having a last lending date (somewhat
>> encoded as long) in that given numeric range.
>>
>>
>>
>> Now, I’d like to get a list of users, **NOT** having a last lending in
>> that given numeric range.
>>
>>
>>
>> I’ve tried adding NOT and ! to the respective fq-query-part without
>> success.
>>
>>
>>
>> Additional info: the field is of type *long (TrieLongField)* and it is 
>> *multiValued**="true"*
>>
>>
>>
>> An example of the full query-string would be:
>>
>>
>>
>> start=0=50=tenant_id:1=u_markedAsDeleted_b:false=u_id_s:[*
>> TO
>> *]=*:*=true=true=count=1=u_userName_cp_s
>> desc=u_userName_cp_s^20 u_displayName_cp_s^20  text^2 text_en text_de
>> text_it=u_userName_cp_s^100 u_displayName_cp_s^20  text^10=100%
>>
>>
>>
>> Thank you for your input and a nice weekend to all of you!
>>
>>
>>
>> Please let me know if I did not share vital details!
>>
>>
>>
>> Mit freundlichen Grüßen
>>
>> Sebastian Riemer, BSc
>>
>>
>>
>>
>>
>> [image: logo_Littera_SC] <http://www.littera.eu/>
>> LITTERA Software & Consulting GmbH
>>
>> A-6060 Hall i.T., Haller Au 19a
>>
>> Telefon: +43(0) 50 765 000, Fax: +43(0) 50 765 118
>>
>> Sitz: Hall i.T., eingetragen beim Handelsgericht Innsbruck,
>> Firmenbuch-Nr. FN 295807k, geschäftsführender Gesellschafter: Albert
>> Unterkircher
>>
>>
>>
>> D-80637 München, Landshuter Allee 8-10
>> Telefon: +49(0) 89 919 29 122, Fax: +49(0) 89 919 29 123
>>
>> Sitz: München, eingetragen beim Amtsgericht München
>> unter HRB 103698, Geschäftsführer: Albert Unterkircher
>>
>> E-Mail: off...@littera.eu
>> Homepage: www.littera.eu
>>
>>
>>
>> Diese Nachricht kann vertrauliche, nicht für die Veröffentlichung
>> bestimmte und/oder rechtlich geschützte Informationen enthalten. Falls Sie
>> nicht der beabsichtigte Empfänger sind, beachten Sie bitte, dass jegliche
>> Veröffentlichung, Verbreitung oder Vervielfältigung dieser Nachricht
>> strengstens untersagt ist. Sollten Sie diese Nachricht irrtümlich erhalten
>> haben, informieren Sie bitte sofort den Absender durch Anruf oder
>> Rücksendung der Nachricht und vernichten Sie diese.
>>
>> This communication may contain information that is legally privileged,
>> confidential or exempt from disclosure.  If you are not the intended
>> recipient, please note that any dissemination, distribution, or copying of
>> this communication is strictly prohibited.  Anyone who receives this
>> message in error should notify the sender immediately by telephone or by
>> return e-mail and delete this communication entirely from his or her
>> computer.
>>
>>
>>
>

Re: How to negate numeric range query - or - how to get records NOT matching a certain numeric range

2020-01-24 Thread David Hastings

having
fq=NOT field:value
works for me,

On Fri, Jan 24, 2020 at 10:56 AM Sebastian Riemer 
wrote:

> Hi all!
>
>
>
> Consider a query containing fq-params like this:
>
>
>
> "*fq*":["tenant_id:1",
>
> "u_markedAsDeleted_b:false",
>
> "u_id_s:[* TO *]",
>
> "(u_lastLendingDate_combined_ls_ns:([8610134693 TO 8611935823]))"]
>
>
>
> This gives me a list of users, having a last lending date (somewhat
> encoded as long) in that given numeric range.
>
>
>
> Now, I’d like to get a list of users, **NOT** having a last lending in
> that given numeric range.
>
>
>
> I’ve tried adding NOT and ! to the respective fq-query-part without
> success.
>
>
>
> Additional info: the field is of type *long (TrieLongField)* and it is 
> *multiValued**="true"*
>
>
>
> An example of the full query-string would be:
>
>
>
> start=0=50=tenant_id:1=u_markedAsDeleted_b:false=u_id_s:[*
> TO
> *]=*:*=true=true=count=1=u_userName_cp_s
> desc=u_userName_cp_s^20 u_displayName_cp_s^20  text^2 text_en text_de
> text_it=u_userName_cp_s^100 u_displayName_cp_s^20  text^10=100%
>
>
>
> Thank you for your input and a nice weekend to all of you!
>
>
>
> Please let me know if I did not share vital details!
>
>
>
> Mit freundlichen Grüßen
>
> Sebastian Riemer, BSc
>
>
>
>
>
> [image: logo_Littera_SC] 
> LITTERA Software & Consulting GmbH
>
> A-6060 Hall i.T., Haller Au 19a
>
> Telefon: +43(0) 50 765 000, Fax: +43(0) 50 765 118
>
> Sitz: Hall i.T., eingetragen beim Handelsgericht Innsbruck,
> Firmenbuch-Nr. FN 295807k, geschäftsführender Gesellschafter: Albert
> Unterkircher
>
>
>
> D-80637 München, Landshuter Allee 8-10
> Telefon: +49(0) 89 919 29 122, Fax: +49(0) 89 919 29 123
>
> Sitz: München, eingetragen beim Amtsgericht München
> unter HRB 103698, Geschäftsführer: Albert Unterkircher
>
> E-Mail: off...@littera.eu
> Homepage: www.littera.eu
>
>
>
> Diese Nachricht kann vertrauliche, nicht für die Veröffentlichung
> bestimmte und/oder rechtlich geschützte Informationen enthalten. Falls Sie
> nicht der beabsichtigte Empfänger sind, beachten Sie bitte, dass jegliche
> Veröffentlichung, Verbreitung oder Vervielfältigung dieser Nachricht
> strengstens untersagt ist. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, informieren Sie bitte sofort den Absender durch Anruf oder
> Rücksendung der Nachricht und vernichten Sie diese.
>
> This communication may contain information that is legally privileged,
> confidential or exempt from disclosure.  If you are not the intended
> recipient, please note that any dissemination, distribution, or copying of
> this communication is strictly prohibited.  Anyone who receives this
> message in error should notify the sender immediately by telephone or by
> return e-mail and delete this communication entirely from his or her
> computer.
>
>
>

Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread David Hastings

This is a really cool idea!  My only concern is that the edge case
searches, where a user knows exactly what they want to find, would be
autocomplete into something that happens to be more "successful" rather
than what they were looking for.  for example, i want to know the legal
implications of jay z's 99 problems.   most of the autocompletes i imagine
would be for the lyrics for the song, or links to the video or jay z
himself, when what im looking for is a line by line analysis of the song
itself and how it relates to the fourth amendment:
http://pdf.textfiles.com/academics/lj56-2_mason_article.pdf

But in general this is a really clever idea, especially in the retail
arena.  However i suspect your use case is more in research, and after
years of dealing with lawyers and librarians, they tend to not like having
their searches intercepted, they know what they're looking for and they
tend to get mad if you assume they dont :)

On Fri, Jan 24, 2020 at 9:59 AM Lucky Sharma  wrote:

> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection and
> You can instead of adding weights inthe document you can also use LTR with
> in Solr to rerank on the features.
>
> Regards,
> Lucky Sharma
>
> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
> audrey.lorberf...@ibm.com,
>  wrote:
>
> > Erik,
> >
> > Thank you! Yes, that's exactly how we were thinking of architecting it.
> > And our ML engineer suggested something else for the suggestion weights,
> > actually -- to build a model that would programmatically update the
> weights
> > based on those suggestions' live clicks @ position k, etc. Pretty cool
> > idea...
> >
> >
> >
> > On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
> >
> > It's a great idea.   And then index that file into a separate lean
> > collection of just the suggestions, along with the weight as another
> field
> > on those documents, to use for ranking them at query time with standard
> > /select queries.  (this separate suggest collection would also have
> > appropriate tokenization to match the partial words as the user types,
> like
> > ngramming)
> >
> > Erik
> >
> >
> > > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > David,
> > >
> > > Thank you, that is useful. So, would you recommend using a (clean)
> > field over an external dictionary file? We have lots of "top queries" and
> > measure their nDCG. A thought was to programmatically generate an
> external
> > file where the weight per query term (or phrase) == its nDCG. Bad idea?
> > >
> > > Best,
> > > Audrey
> > >
> > > On 1/20/20, 11:51 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Ive used this quite a bit, my biggest piece of advice is to
> > choose a field
> > >that you know is clean, with well defined terms/words, you dont
> > want an
> > >autocomplete that has a massive dictionary, also it will make
> the
> > >start/reload times pretty slow
> > >
> > >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > >audrey.lorberf...@ibm.com  wrote:
> > >
> > >> Hi All,
> > >>
> > >> We plan to incorporate a query autocomplete functionality into our
> > search
> > >> engine (like this:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > >> ). And I was wondering if anyone has personal experience with this
> > >> component and would like to share? Basically, we are just looking
> > for some
> > >> best practices from more experienced Solr admins so that we have a
> > starting
> > >> place to launch this in our beta.
> > >>
> > >> Thank you!
> > >>
> > >> Best,
> > >> Audrey
> > >>
> > >
> > >
> >
> >
> >
> >
>

Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread David Hastings

Not a bad idea at all, however ive never used an external file before, just
a field in the index, so not an area im familiar with

On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> David,
>
> Thank you, that is useful. So, would you recommend using a (clean) field
> over an external dictionary file? We have lots of "top queries" and measure
> their nDCG. A thought was to programmatically generate an external file
> where the weight per query term (or phrase) == its nDCG. Bad idea?
>
> Best,
> Audrey
>
> On 1/20/20, 11:51 AM, "David Hastings" 
> wrote:
>
> Ive used this quite a bit, my biggest piece of advice is to choose a
> field
> that you know is clean, with well defined terms/words, you dont want an
> autocomplete that has a massive dictionary, also it will make the
> start/reload times pretty slow
>
> On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Hi All,
> >
> > We plan to incorporate a query autocomplete functionality into our
> search
> > engine (like this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > ). And I was wondering if anyone has personal experience with this
> > component and would like to share? Basically, we are just looking
> for some
> > best practices from more experienced Solr admins so that we have a
> starting
> > place to launch this in our beta.
> >
> > Thank you!
> >
> > Best,
> > Audrey
> >
>
>
>

Re: Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread David Hastings

Ive used this quite a bit, my biggest piece of advice is to choose a field
that you know is clean, with well defined terms/words, you dont want an
autocomplete that has a massive dictionary, also it will make the
start/reload times pretty slow

On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> We plan to incorporate a query autocomplete functionality into our search
> engine (like this: https://lucene.apache.org/solr/guide/8_1/suggester.html
> ). And I was wondering if anyone has personal experience with this
> component and would like to share? Basically, we are just looking for some
> best practices from more experienced Solr admins so that we have a starting
> place to launch this in our beta.
>
> Thank you!
>
> Best,
> Audrey
>

Re: Failed to connect to server

2020-01-17 Thread David Hastings

something like this in your solr config:

 autosuggest false text 0.005 
DocumentDictionaryFactory title weight true true 





On Fri, Jan 17, 2020 at 12:02 PM rhys J  wrote:

> On Thu, Jan 16, 2020 at 3:48 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > >  'Error: Solr core is loading'
> >
> > do you have any suggesters or anything configured that would get rebuilt?
> >
> >
> > I don't think so? But I'm not quite sure what you are asking?
> >
>
> Rhys
>

Re: Failed to connect to server

2020-01-16 Thread David Hastings

>  'Error: Solr core is loading'

do you have any suggesters or anything configured that would get rebuilt?



On Thu, Jan 16, 2020 at 3:41 PM rhys J  wrote:

> On Thu, Jan 16, 2020 at 3:27 PM Edward Ribeiro 
> wrote:
>
> > A regular update is a delete followed by an indexing of the document. So
> > technically both are indexes. :) If there's an atomic update (
> >
> https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html
> > ), Solr would throw some sort of version conflict exception like
> >
> >
> These would have been atomic updates running at the same time I was
> importing a csv file into another core.
>
> After the connection errors, I noticed in the log that there was an error
> from a curl statement that said 'Error: Solr core is loading'
>
> The connection refused exception does not seem related to the indexing by
> > itself. Maybe it has to do with you hitting the maximum connection
> requests
> > allowed per host. See in the link below the maxConnectionsPerHost and
> > maxConnections parameters of your Solr version:
> >
> >
> >
> https://lucene.apache.org/solr/guide/6_6/format-of-solr-xml.html#Formatofsolr.xml-The%3CshardHandlerFactory%3Eelement
> >
> >
> Thank you for this. This was helpful. I have increased the number of
> maxConnections to see if this fixes the problem.
>
> Rhys
>

Re: SolrCloud upgrade concern

2020-01-16 Thread David Hastings

ha, im on that thread, didnt know they got stored on a site, thats good to
know!

-i stand by what i said in there.  so i have nothing more to add

On Thu, Jan 16, 2020 at 3:29 PM Arnold Bronley 
wrote:

> Hi,
>
> I am trying to upgrade my system from Solr master-slave architecture to
> SolrCloud architecture.
> Meanwhile, I stumbled upon this very negative post about SolrCloud.
>
>
> https://lucene.472066.n3.nabble.com/A-Last-Message-to-the-Solr-Users-td4452980.html
>
>
> Given that it is from one of the initial authors of SolrCloud
> functionality, I am having second thoughts about the upgrade and I am
> somewhat concerned.
>
> I will greatly appreciate any advice/feedback on this from Solr community.
>

Re: does copyFields increase indexe size ?

2019-12-26 Thread David Hastings

The field is stored somewhere 

> On Dec 26, 2019, at 3:22 PM, Nicolas Paris  wrote:
> 
> Hi Eric
> 
> Below a part of the managed-schema. There is 1k section* fields. The
> second experience, I removed the copyField, droped the collection and
> re-indexed the whole. To mesure the index size, I went to solr-cloud and
> looked in the cloud part: 40GO per shard. I also look at the folder
> size. I made some tests and the _text_ field is indexed.
> 
> multiValued="true"/> 
> multiValued="true"/>
>
> 
> positionIncrementGap="100">
> 
>
>  
> 
> replacement=" " replace="all"/>
>  
>
> articles="lang/contractions_fr.txt"/>
>
> words="lang/stopwords_fr.txt" format="snowball" />
>
>  
>  
>  
> synonyms="synonyms-fr.txt" ignoreCase="true" expand="true"/>
> replacement=" " replace="all"/>
>  
>
> articles="lang/contractions_fr.txt"/>
>
> words="lang/stopwords_fr.txt" format="snowball" />
>
>  
>
> 
> 
> 
> 
> 
>> On Thu, Dec 26, 2019 at 02:16:32PM -0500, Erick Erickson wrote:
>> This simply cannot be true unless the destination copyField is 
>> indexed=false, docValues=false stored=false. I.e. “some circumstances” means 
>> there’s really no use in using the copyField in the first place. I suppose 
>> that if you don’t store any term vectors, no position information nothing 
>> except, say, the terms then maybe you’ll have extremely minimal size. But 
>> even in that case, I’d use the original field in an “fq” clause which 
>> doesn’t use any scoring in place of using the copyField.
>> 
>> Each field is stored in a separate part of the relevant files (.tim, .pos, 
>> etc). Term frequencies are kept on a _per field_ basis for instance.
>> 
>> So this pretty much has to be small sample size or other measurement error.
>> 
>> Best,
>> Erick
>> 
 On Dec 26, 2019, at 9:27 AM, Nicolas Paris  
 wrote:
>>> 
>>> Anyway, that´s good news copy field does not increase indexe size in
>>> some circumstance:
>>> - the copied fields and the target field share the same datatype
>>> - the target field is not stored
>>> 
>>> this is tested on text fields
>>> 
>>> 
>>> On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote:
 
 On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote:
> #2 you initially said you were talking about 1k documents. 
 
 Hi Dave. Again, sorry for the confusion. This is 1k fields
 (general_text), over 50M large  documents copied into one _text_ field. 
 4 shards, 40GB per shard in both case, with/without the _text_ field
 
> 
>> On Dec 25, 2019, at 3:07 AM, Nicolas Paris  
>> wrote:
>> 
>> 
>>> 
>>> If you are redoing the indexing after changing the schema and
>>> reloading/restarting, then you can ignore me.
>> 
>> I am sorry to say that I have to ignore you. Indeed, my tests include
>> recreating the collection from scratch - with and without the copy
>> fields.
>> In both cases the index size is the same ! (while the _text_ field is
>> working correctly)
>> 
>>> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
 On 12/24/2019 5:11 PM, Nicolas Paris wrote:
 Do you mean "copy fields" is only an action of changing the schema ?
 I was thinking it was adding a new field and eventually a new index to
 the collection
>>> 
>>> The copy that copyField does happens at index time.  Reindexing is 
>>> required
>>> after changing the schema, or nothing happens.
>>> 
>>> If you are redoing the indexing after changing the schema and
>>> reloading/restarting, then you can ignore me.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 
>> -- 
>> nicolas
> 
 
 -- 
 nicolas
 
>>> 
>>> -- 
>>> nicolas
>> 
> 
> -- 
> nicolas

Re: Help for importing large data (approx. 8GB) from old solr version to new solr version

2019-12-25 Thread David Hastings

Exactly. Although I’m a bit curious why your going a .1 version up, I always 
wait until an x2, so I won’t be upgrading until 9.3 

> On Dec 25, 2019, at 9:45 AM, Erick Erickson  wrote:
> 
> Should work. At any rate, just try it. Since all you’re doing is copying 
> data, even if the new installation doesn’t work you still have the original.
> 
>> On Dec 25, 2019, at 1:35 AM, Ken Walker  wrote:
>> 
>> Hello Erick,
>> 
>> Thanks for your reply!
>> 
>> You mean that, we should follow below steps right?
>> Here is the data directory path :
>> solr/solr-8.2.0/server/solr/product/item_core/data
>> 
>> STEPS :-
>> 1. Stop old solr-8.2.0 server
>> 2. Copy data directory (from old solr version to new solr version)
>> copy solr/solr-8.2.1/server/solr/product/item_core/data to
>> solr/solr-8.3.1/server/solr/product/item_core/data
>> 3. Start new solr version solr-8.3.1
>> 
>> Is it correct way to copy just index only from old to new solr version?
>> Is it lose any data or anything break in new solr version ?
>> 
>> Thanks in advance!
>> -Ken
>

Re: xms/xmx choices

2019-12-09 Thread David Hastings

in case any one is interested, i made the memory changes as well as two
changes to
XX:ParallelGCThreads  8->20
XX:ConcGCThreads . 4->5

old:
https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMTkvMTIvNi8tLXNvbHJfZ2MubG9nLjAuY3VycmVudC0tMTQtMjEtMTA==WEB

now:
https://gceasy.io/diamondgc-report.jsp?p=c2hhcmVkLzIwMTkvMTIvOS8tLXNvbHJfZ2MubG9nLjAuY3VycmVudC0tMTQtMS02=WEB

however there hasnt been really anything noticeable as far as solr itself
is concerned when it comes to qtimes,
pre java changes:
 43963 searches
Complete SOLR average : 5.33 / 10th seconds for SOLR
Raw SOLR over 1/1000 secs : 208, 0.47%
Raw SOLR over 1000/1000 secs : 5261, 11.97%

post solr changes:
 28369 searches
Complete SOLR average : 4.77 / 10th seconds for SOLR
Raw SOLR over 1/1000 secs : 94, 0.33%
Raw SOLR over 1000/1000 secs : 3583, 12.63%




On Fri, Dec 6, 2019 at 9:39 AM David Hastings 
wrote:

> Thanks you guys, this has been educational, i uploaded up to now, the
> server was restarted after adding the extra memory, so
>
> https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTIvNi8tLXNvbHJfZ2MubG9nLjAuY3VycmVudC0tMTQtMjEtMTA==WEB
>
> is what im looking at.  tuning the JVM is new to me, so im just going by
> what ive researched and what this site is saying.
> from what i can tell:
>   the peak looks like 31gb would be perfect, will implement that today
>   throughput is seems good, assuming gceasy recommendation of above 95% is
> the target and im at 99.6
>   latency looks like its as good as I really care to get, who really cares
> about 200ms
>   as far as heap after a GC it looks like it recovered well? or am i
> missing something?  the red spikes of a full GC like 28gb, and right after
> its down to 14gb
>
> I really appreciate this input, its educational/helpful
> -Dave
>
>
>
>
>
> On Fri, Dec 6, 2019 at 7:48 AM Erick Erickson 
> wrote:
>
>> A replication shouldn’t have consumed that much heap. It’s mostly I/O,
>> just a write through. If replication really consumes huge amounts of heap
>> we need to look at that more closely. Personally I suspect/hope it’s
>> coincidental, but that’s only a guess. You can attach jconsole to the
>> running process and monitor heap usage in real-time, jconsole is part of
>> the JDK so should be relatively easy to install. It has a nifty “gc now”
>> button that you can use to see if the heap you’re accumulating is just
>> garbage or really accumulates…
>>
>> And if this really is related to replication and that much heap is
>> actually used, we need to figure out why. Shawn’s observation that there is
>> very little heap recovered is worrying.
>>
>> Best,
>> Erick
>>
>> > On Dec 6, 2019, at 7:37 AM, Dave  wrote:
>> >
>> > Actually at about that time the replication finished and added about
>> 20-30gb to the index from the master.  My current set up goes
>> > Indexing master -> indexer slave/production master (only replicated on
>> command)-> three search slaves (replicate each 15 minutes)
>> >
>> > We added about 2.3m docs, then I replicated it to the production master
>> and since there was a change it replicated out to the slave node the gc
>> came from
>> >
>> > I’ll set one of the slaves to 31/31 and force all load to that one and
>> see how she does. Thanks!
>> >
>> >
>> >> On Dec 6, 2019, at 1:02 AM, Shawn Heisey  wrote:
>> >>
>> >> On 12/5/2019 12:57 PM, David Hastings wrote:
>> >>> That probably isnt enough data, so if youre interested:
>> >>> https://gofile.io/?c=rZQ2y4
>> >>
>> >> The previous one was less than 4 minutes, so it doesn't reveal
>> anything useful.
>> >>
>> >> This one is a little bit less than two hours.  That's more useful, but
>> still pretty short.
>> >>
>> >> Here's the "heap after GC" graph from the larger file:
>> >>
>> >>
>> https://www.dropbox.com/s/q9hs8fl0gfkfqi1/david.hastings.gc.graph.2019.12.png?dl=0
>> >>
>> >> At around 14:15, the heap usage was rather high. It got up over 25GB.
>> There were some very long GCs right at that time, which probably means they
>> were full GCs.  And they didn't free up any significant amount of memory.
>> So I'm betting that sometimes you actually *do* need a big chunk of that
>> 60GB of heap.  You might try reducing it to 31g instead of 6m.  Java's
>> memory usage is a lot more efficient if the max heap size is less than 32
>> GB.
>> >>
>> >> I can't give you any information about what happened at that time
>> which required so much heap.  You could see if you have logfiles that cover
>> that timeframe.
>> >>
>> >> Thanks,
>> >> Shawn
>>
>>

Re: Search returning unexpected matches at the top

2019-12-06 Thread David Hastings

whats the field type for:
clt_ref_no
*_no isnt a default dynamic character, and owl-2924-8 usually translates into
owl 2924 8




David J. Hastings | Lead Developer
dhasti...@wshein.com | 716.882.2600 x 176

William S. Hein & Co., Inc.
2350 North Forest Road | Getzville, NY 14068
www.wshein.com/contact-us


From: Alexandre Rafalovitch 
Sent: Friday, December 6, 2019 11:15 AM
To: solr-user
Subject: Re: Search returning unexpected matches at the top

You can enable debug which will show you what matches and why. Check
the reference guide for parameters:
https://lucene.apache.org/solr/guide/8_1/common-query-parameters.html#debug-parameter

Regards,
   Alex.

On Fri, 6 Dec 2019 at 11:00, rhys J  wrote:
>
> I have a search box that is just searching every possible core, and every
> possible field.
>
> When I enter 'owl-2924-8', I expect the clt_ref_no of OWL-2924-8 to float
> to the top, however it is the third result in my list.
>
> Here is the code from the search:
>
> on_data({
>   "responseHeader":{
> "status":0,
> "QTime":31,
> "params":{
>   "hl":"true",
>   "indent":"on",
>   "fl":"debt_id, clt_ref_no",
>   "start":"0",
>   "sort":"score desc, id asc",
>   "rows":"500",
>   "version":"2.2",
>   "q":"clt_ref_no:owl\\-2924\\-8 debt_descr:owl\\-2924\\-8
> comments:owl\\-2924\\-8 reference_no:owl\\-2924\\-8 ",
>   "core":"debt",
>   "json.wrf":"on_data",
>   "urlquery":"owl-2924-8",
>   "callback":"?",
>   "wt":"json"}},
>   "response":{"numFound":85675,"start":0,"docs":[
>   {
> "clt_ref_no":"2924",
> "debt_id":"574574"},
>   {
> "clt_ref_no":"2924",
> "debt_id":"598663"},
>   {
> "clt_ref_no":"OWL-2924-8",
> "debt_id":"624401"},
>   {
> "clt_ref_no":"OWL-2924-8",
> "debt_id":"628157"},
>   {
> "clt_ref_no":"2924",
> "debt_id":"584807"},
>   {
> "clt_ref_no":"U615-2924-8",
> "debt_id":"628310"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"596713"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"624401"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"628157"},
>   {
>
> I'm not interested in having a specific search with quotes around it,
> because this is searching everything, so it's a fuzzy search. But I am
> interested in understanding why 'owl-2924-8' doesn't come out on top of the
> search.
>
> As you can see, I'm sorting by score and then id, which should take care of
> things, but it's not.
>
> Thanks,
>
> Rhys

Re: xms/xmx choices

2019-12-06 Thread David Hastings

Thanks you guys, this has been educational, i uploaded up to now, the
server was restarted after adding the extra memory, so
https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMTIvNi8tLXNvbHJfZ2MubG9nLjAuY3VycmVudC0tMTQtMjEtMTA==WEB

is what im looking at.  tuning the JVM is new to me, so im just going by
what ive researched and what this site is saying.
from what i can tell:
  the peak looks like 31gb would be perfect, will implement that today
  throughput is seems good, assuming gceasy recommendation of above 95% is
the target and im at 99.6
  latency looks like its as good as I really care to get, who really cares
about 200ms
  as far as heap after a GC it looks like it recovered well? or am i
missing something?  the red spikes of a full GC like 28gb, and right after
its down to 14gb

I really appreciate this input, its educational/helpful
-Dave





On Fri, Dec 6, 2019 at 7:48 AM Erick Erickson 
wrote:

> A replication shouldn’t have consumed that much heap. It’s mostly I/O,
> just a write through. If replication really consumes huge amounts of heap
> we need to look at that more closely. Personally I suspect/hope it’s
> coincidental, but that’s only a guess. You can attach jconsole to the
> running process and monitor heap usage in real-time, jconsole is part of
> the JDK so should be relatively easy to install. It has a nifty “gc now”
> button that you can use to see if the heap you’re accumulating is just
> garbage or really accumulates…
>
> And if this really is related to replication and that much heap is
> actually used, we need to figure out why. Shawn’s observation that there is
> very little heap recovered is worrying.
>
> Best,
> Erick
>
> > On Dec 6, 2019, at 7:37 AM, Dave  wrote:
> >
> > Actually at about that time the replication finished and added about
> 20-30gb to the index from the master.  My current set up goes
> > Indexing master -> indexer slave/production master (only replicated on
> command)-> three search slaves (replicate each 15 minutes)
> >
> > We added about 2.3m docs, then I replicated it to the production master
> and since there was a change it replicated out to the slave node the gc
> came from
> >
> > I’ll set one of the slaves to 31/31 and force all load to that one and
> see how she does. Thanks!
> >
> >
> >> On Dec 6, 2019, at 1:02 AM, Shawn Heisey  wrote:
> >>
> >> On 12/5/2019 12:57 PM, David Hastings wrote:
> >>> That probably isnt enough data, so if youre interested:
> >>> https://gofile.io/?c=rZQ2y4
> >>
> >> The previous one was less than 4 minutes, so it doesn't reveal anything
> useful.
> >>
> >> This one is a little bit less than two hours.  That's more useful, but
> still pretty short.
> >>
> >> Here's the "heap after GC" graph from the larger file:
> >>
> >>
> https://www.dropbox.com/s/q9hs8fl0gfkfqi1/david.hastings.gc.graph.2019.12.png?dl=0
> >>
> >> At around 14:15, the heap usage was rather high. It got up over 25GB.
> There were some very long GCs right at that time, which probably means they
> were full GCs.  And they didn't free up any significant amount of memory.
> So I'm betting that sometimes you actually *do* need a big chunk of that
> 60GB of heap.  You might try reducing it to 31g instead of 6m.  Java's
> memory usage is a lot more efficient if the max heap size is less than 32
> GB.
> >>
> >> I can't give you any information about what happened at that time which
> required so much heap.  You could see if you have logfiles that cover that
> timeframe.
> >>
> >> Thanks,
> >> Shawn
>
>

Re: xms/xmx choices

2019-12-05 Thread David Hastings

and if this may be of use:
https://imgur.com/a/qXBuSxG

just been more or less winging the options since solr 1.3


On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:

> On 12/5/2019 11:58 AM, David Hastings wrote:
> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to
> be
> > the same so thats the change #1 on my end, I am just concerned of
> dropping
> > it from 60 as thus far over the last few years I have had no problems nor
> > performance issues.  I know its said a lot of times to make it lower and
> > let the OS use the ram for caching the file system/index files, so my
> first
> > experiment was going to be around 20gb, was wondering if this seems
> sound,
> > or should i go even lower?
>
> The Xms and Xmx settings should be the same so Java doesn't need to take
> special action to increase the pool size when more than the minimum is
> required.  Java tends to always increase to the maximum as it runs, so
> there's usually little benefit to specifying a lower minimum than the
> maximum.  With a 60GB max heap, Java is likely to grab a little more
> than 60GB from the OS, regardless of how much heap is actually in use.
>
> If you can provide GC logs from Solr that cover a signficant timeframe,
> especially heavy indexing, we can analyze those and make an estimate
> about the values you should have for Xms and Xmx.  It will only be a
> guess ... something might happen later that requires more heap.
>
> We can't make recommendations without hard data.  The information you
> provided isn't enough to guess how much heap you'll need.  Depending on
> how such a system is used, a few GB might be enough, or you might need a
> lot more.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Thanks,
> Shawn
>

Re: xms/xmx choices

2019-12-05 Thread David Hastings

That probably isnt enough data, so if youre interested:

https://gofile.io/?c=rZQ2y4

On Thu, Dec 5, 2019 at 2:52 PM David Hastings 
wrote:

> I know theres no hard answer, and I know the Xms and Xmx should be the
> same, but it was a set it and forget it sort of thing from years ago.  I
> will definitely be changing it but figured I may as well figure out as
> much as possible from this user group resource.
> as far as the raw GC data goes:
> https://pastebin.com/vBtpYR1W
>
> (i dont know if people still use pastebin)  i can get more if needed.  the
> systems dont do ANY indexing at all, they are search only slaves.  they
> share resources only with a DB install, and one node will never do both
> live search and live DB.  If theres any more info youd like I would be
> happy to provide, this is interesting.
>
> On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:
>
>> On 12/5/2019 11:58 AM, David Hastings wrote:
>> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
>> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed
>> to be
>> > the same so thats the change #1 on my end, I am just concerned of
>> dropping
>> > it from 60 as thus far over the last few years I have had no problems
>> nor
>> > performance issues.  I know its said a lot of times to make it lower and
>> > let the OS use the ram for caching the file system/index files, so my
>> first
>> > experiment was going to be around 20gb, was wondering if this seems
>> sound,
>> > or should i go even lower?
>>
>> The Xms and Xmx settings should be the same so Java doesn't need to take
>> special action to increase the pool size when more than the minimum is
>> required.  Java tends to always increase to the maximum as it runs, so
>> there's usually little benefit to specifying a lower minimum than the
>> maximum.  With a 60GB max heap, Java is likely to grab a little more
>> than 60GB from the OS, regardless of how much heap is actually in use.
>>
>> If you can provide GC logs from Solr that cover a signficant timeframe,
>> especially heavy indexing, we can analyze those and make an estimate
>> about the values you should have for Xms and Xmx.  It will only be a
>> guess ... something might happen later that requires more heap.
>>
>> We can't make recommendations without hard data.  The information you
>> provided isn't enough to guess how much heap you'll need.  Depending on
>> how such a system is used, a few GB might be enough, or you might need a
>> lot more.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Thanks,
>> Shawn
>>
>

Re: xms/xmx choices

2019-12-05 Thread David Hastings

I know theres no hard answer, and I know the Xms and Xmx should be the
same, but it was a set it and forget it sort of thing from years ago.  I
will definitely be changing it but figured I may as well figure out as
much as possible from this user group resource.
as far as the raw GC data goes:
https://pastebin.com/vBtpYR1W

(i dont know if people still use pastebin)  i can get more if needed.  the
systems dont do ANY indexing at all, they are search only slaves.  they
share resources only with a DB install, and one node will never do both
live search and live DB.  If theres any more info youd like I would be
happy to provide, this is interesting.

On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:

> On 12/5/2019 11:58 AM, David Hastings wrote:
> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to
> be
> > the same so thats the change #1 on my end, I am just concerned of
> dropping
> > it from 60 as thus far over the last few years I have had no problems nor
> > performance issues.  I know its said a lot of times to make it lower and
> > let the OS use the ram for caching the file system/index files, so my
> first
> > experiment was going to be around 20gb, was wondering if this seems
> sound,
> > or should i go even lower?
>
> The Xms and Xmx settings should be the same so Java doesn't need to take
> special action to increase the pool size when more than the minimum is
> required.  Java tends to always increase to the maximum as it runs, so
> there's usually little benefit to specifying a lower minimum than the
> maximum.  With a 60GB max heap, Java is likely to grab a little more
> than 60GB from the OS, regardless of how much heap is actually in use.
>
> If you can provide GC logs from Solr that cover a signficant timeframe,
> especially heavy indexing, we can analyze those and make an estimate
> about the values you should have for Xms and Xmx.  It will only be a
> guess ... something might happen later that requires more heap.
>
> We can't make recommendations without hard data.  The information you
> provided isn't enough to guess how much heap you'll need.  Depending on
> how such a system is used, a few GB might be enough, or you might need a
> lot more.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Thanks,
> Shawn
>

Re: From solr to solr cloud

2019-12-05 Thread David Hastings

are you noticing performance decreases in stand alone solr as of now?

On Thu, Dec 5, 2019 at 2:29 PM Vignan Malyala  wrote:

> Hi
> I currently have 500 collections in my stand alone solr. Bcoz of day by day
> increase in Data, I want to convert it into solr cloud.
> Can you suggest me how to do it successfully.
> How many shards should be there?
> How many nodes should be there?
> Are so called nodes different machines i should take?
> How many zoo keeper nodes should be there?
> Are so called zoo keeper nodes different machines i should take?
> Total how many machines i have to take to implement scalable solr cloud?
>
> Plz detail these questions. Any of documents on web aren't clear for
> production environments.
> Thanks in advance.
>

xms/xmx choices

2019-12-05 Thread David Hastings

Hey all, over time ive adjusted and changed the solr Xms/Xmx various times
with not too much thought aside from more is better, but ive noticed in
many of the emails the recommended values are much lower than the numbers
ive historically put in.  i never really bothered to change them as the
performance was always more than acceptable.  Until now as well just got a
memory upgrade on our solr nodes so figure may as well do it right.

so im sitting at around
580 gb core
150gb core
270gb core
300gb core
depending on merges etc.  with around 50k-100k searches a day depending on
the time of year/school calendar
the three live nodes each have 4tb of decent SSD's that hold the indexes,
and now just went from 148gb to 288gb of memory.
as of now we do an xms of 8gb and xmx of 60gb, generally through the
dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to be
the same so thats the change #1 on my end, I am just concerned of dropping
it from 60 as thus far over the last few years I have had no problems nor
performance issues.  I know its said a lot of times to make it lower and
let the OS use the ram for caching the file system/index files, so my first
experiment was going to be around 20gb, was wondering if this seems sound,
or should i go even lower?

Thanks, always good learning with this email group.
-Dave

Re: Exact match

2019-12-02 Thread David Hastings

if the query is in quotes it will work.  also, not sure if youve been
following, but get rid of:
StopFilterFactory and all stopwords, or just make your stop word file empty
if you need it to work in non quotes, add them to the query post
submission ?

On Mon, Dec 2, 2019 at 3:44 PM OTH  wrote:

> Hello,
>
> What would be the best way to get exact matches (if any) to a query?
>
> E.g.:  Let's the document text is:  "united states of america".
> Currently, any query containing one or more of the three words "united",
> "states", or "america" will match with the above document.  I would like a
> way so that the document matches only and only if the query were also
> "united states of america" (case-insensitive).
>
> Document field type:  TextField
> Index Analyzer: TokenizerChain
> Index Tokenizer: StandardTokenizerFactory
> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
> SnowballPorterFilterFactory
> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
> ones above.
>
> FYI I'm relatively novice at Solr / Lucene / Search.
>
> Much appreciated
> Omer
>

Re: A Last Message to the Solr Users

2019-11-27 Thread David Hastings

Personally I found nothing in solr cloud worth changing from standalone
for, and just added more complications, more servers, and required becoming
an expert/knowledgeable in zoo keeper, id rather spend my time developing
than becoming a systems administrator

On Wed, Nov 27, 2019 at 3:45 AM Mark Miller  wrote:

> This is your queue to come and make your jokes with your name attached. I’m
> sure the Solr users will appreciate them more than I do. I can’t laugh at
> this situation because I take production code seriously.
>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: Using an & in an indexed field and then querying for it.

2019-11-25 Thread David Hastings

This again is why its recommended to have something inbetween your js and
solr, also gets have limitations so if eventually you start getting more
complications you may hit it:

https://stackoverflow.com/questions/2659952/maximum-length-of-http-get-request

On Mon, Nov 25, 2019 at 2:40 PM rhys J  wrote:

> On Mon, Nov 25, 2019 at 2:36 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > its breaking on the & because its in the url and you are most likely
> > sending a get request to solr.  you should send it as post or as %26
> >
> >
> The package I am using doesn't have a postJSON function available, so I'm
> using their getJSON function.
>
> I changed the & to %26, and that fixed things.
>
> Thanks,
>
> Rhys
>

Re: Using an & in an indexed field and then querying for it.

2019-11-25 Thread David Hastings

its breaking on the & because its in the url and you are most likely
sending a get request to solr.  you should send it as post or as %26

On Mon, Nov 25, 2019 at 2:32 PM rhys J  wrote:

> I have some fields that have text like so:
>
> Reliable Van & Storage.
>
> They indexed fine when I used curl and csv files to read them into the
> core.
>
> Now when I try to query for them, I get errors.
>
> If I try escaping it like so \&, I get the following error:
>
> on_data({
>   "responseHeader":{
> "status":400,
> "QTime":1,
> "params":{
>   "q":"name1:( reliable van \\",
>   "core":"dbtr",
>   "json.wrf":"on_data",
>   "hl":"true",
>   "indent":"on",
>   "start":"0",
>   "stor )":"",
>   "callback":"?",
>   "rows":"50",
>   "version":"2.2",
>   "wt":"json"}},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.parser.TokenMgrError"],
> "msg":"org.apache.solr.search.SyntaxError: Cannot parse 'name1:(
> reliable van \\': Lexical error at line 1, column 23.  Encountered:
>  after : \"\"",
> "code":400}})
>
> If I try html encoding it like so:  I get the following error:
>
>
>
> on_data({
>   "responseHeader":{
> "status":400,
> "QTime":3,
> "params":{
>   "q":"name1:( reliable van ",
>   "core":"dbtr",
>   "json.wrf":"on_data",
>   "hl":"true",
>   "indent":"on",
>   "amp; stor )":"",
>   "start":"0",
>   "callback":"?",
>   "rows":"50",
>   "version":"2.2",
>   "wt":"json"}},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.parser.ParseException"],
> "msg":"org.apache.solr.search.SyntaxError: Cannot parse 'name1:(
> reliable van ': Encountered \"\" at line 1, column 21.\nWas
> expecting one of:\n ...\n ...\n ...\n
> \"+\" ...\n\"-\" ...\n ...\n\"(\" ...\n\")\"
> ...\n\"*\" ...\n ...\n ...\n
> ...\n ...\n ...\n\"[\" ...\n
> \"{\" ...\n ...\n\"filter(\" ...\n ...\n
>   ",
> "code":400}})
>
>
> How can I search for a field that has an & without breaking the
> parser, or is it not possible because & is used as a special
> character?
>
> Thanks,
>
> Rhys
>

Re: How to tell which core was used based on Json or XML response from Solr

2019-11-25 Thread David Hastings

you missed the part about adding = to the query:
=all=mega

returns for me:

 "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "q":"*:*",
  "core":"mega",
  "df":"text",
  "q.op":"AND",
  "rows":"10",
  "echoParams":"all"}},

also we are a perl shop as well, you could implement something as
simple as this in a cgi script or something:


my $url = $searcher;
my $agent = new LWP::UserAgent;
my $request = POST($url, $data);
my $response = $agent->request($request)->decoded_content;




On Mon, Nov 25, 2019 at 10:37 AM rhys J  wrote:

> On Mon, Nov 25, 2019 at 1:10 AM Paras Lehana 
> wrote:
>
> > Hey rhys,
> >
> > What David suggested is what we do for querying Solr. You can figure out
> > our frontend implementation of Auto-Suggest by seeing the AJAX requests
> > fired when you type in the search box on www.indiamart.com.
> >
>
>  That is pretty cool.
>
> I've ended up with something that highlights the match in a results table.
> It's working, and the client seems happy with that implementation for now.
>
>
> > Why are you using two jQuery files? If you have a web server, you already
> > know that which core you queried from. Just convert the Solr JSON
> response
> > and add the key "core" and return the modified JSON response. Keep your
> > front-end query simple - just describe your query. All the other
> parameters
> >
>
> We are using 2 jquery versions, because this tool is running a tool that
> has an old version of jquery attached to it. Because of that, I'm doing the
> trick where you can load 2 different versions at the same time.
>
>
> > can be added on the web server side. Anyways, why do you want to know the
> > core name?
> >
>
> I need to know the core name, because each core has different values in the
> documents, and I want to display those values based on which core was
> queried.
>
> This is kind of like an omnibox, where the user will just start typing
> stuff into it. Based on what is typed, I will search a different core to
> provide the right answer to them.
>
> Thanks,
>
> Rhys
>

Re: How to tell which core was used based on Json or XML response from Solr

2019-11-22 Thread David Hastings

i personally dont like php, but it may just be the easiest way to do what
you need assuming you have a basic web server,
send your search query to php, and use $_GET or $_POST to read it into a
variable:
https://www.php.net/manual/en/reserved.variables.get.php

then send that to the solr server in the same piece of php with curl

https://phpenthusiast.com/blog/five-php-curl-examples

and return the raw result if you want.  at the very least it hides its url,
but with this you can block the solr port to outside ip's and only allow 80
or whatever your webserver is using


On Fri, Nov 22, 2019 at 1:43 PM rhys J  wrote:

> On Fri, Nov 22, 2019 at 1:39 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > 2 things (maybe 3):
> > 1.  dont have this code facing a client thats not you, otherwise anyone
> > could view the source and see where the solr server is, which means they
> > can destroy your index or anything they want.  put at the very least a
> > simple api/front end in between the javascript page for the user and the
> > solr server
> >
>
> Is there a way I can fix this?
>
>
> > 2. i dont think there is a way, you would be better off indexing an
> > indicator of sorts into your documents
> >
>
> Oh this is a good idea.
>
> Thanks!
>
> 3. the jquery in your example already has the core identified, not sure why
> > the receiving javascript wouldn't be able to read that variable unless im
> > missing something.
> >
> >
> There's another function on_data that is being called by the url, which
> does not have any indication of what the core was, only the response from
> the url.
>
> Thanks,
>
> Rhys
>

Re: How to tell which core was used based on Json or XML response from Solr

2019-11-22 Thread David Hastings

2 things (maybe 3):
1.  dont have this code facing a client thats not you, otherwise anyone
could view the source and see where the solr server is, which means they
can destroy your index or anything they want.  put at the very least a
simple api/front end in between the javascript page for the user and the
solr server
2. i dont think there is a way, you would be better off indexing an
indicator of sorts into your documents
3. the jquery in your example already has the core identified, not sure why
the receiving javascript wouldn't be able to read that variable unless im
missing something.

On Fri, Nov 22, 2019 at 1:27 PM rhys J  wrote:

> I'm implementing an autocomplete search box for Solr.
>
> I'm using JSON as my response style, and this is the jquery code.
>
>
>  var url='http://10.40.10.14:8983/solr/'+core+'/select/?q='+queryField +
>
>
> query+'=2.2=true=0=50=on=json=?=on_data';
>
>  jQuery_3_4_1.getJSON(url);
>
> ___
>
> on_data(data)
> {
>  var docs = data.response.docs;
> jQuery_3_4_1.each(docs, function(i, item) {
>
> var trLink = ' href="#" onclick=local_goto_dbtr(' + item.debtor_id + '); return true;"> '
>  + item.debtor_id + '';
>
> trLink += '' + item.name1 + '';
> trLink += '' + item.dl1 + '';
> trLink += '';
>
> jQuery_3_4_1('#resultsTable').prepend(jQuery_3_4_1(trLink));
> });
>
> }
>
> the jQuery_3_4_1 variable is replacing $ because I needed to have 2
> different versions of jQuery running in the same document.
>
> I'd like to know if there's something I'm missing that will indicate which
> core I've used in Solr based on the response.
>
> Thanks,
>
> Rhys
>

Re: Highlighting on typing in search box

2019-11-21 Thread David Hastings

you can modify the result in this SO question to fit your needs:

https://stackoverflow.com/questions/16742610/retrieve-results-from-solr-using-jquery-calls

On Thu, Nov 21, 2019 at 10:42 AM rhys J  wrote:

> Are there any recommended APIs or code examples of using Solr and then
> highlighting results below the search box?
>
> I'm trying to implement a search box that will search solr as the user
> types, if that makes sense?
>
> Thanks,
>
> Rhys
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

I use 3 word shingles with stopwords for my MLT ML trainer that worked
pretty well for such a solution, but for a full index the size became
prohibitive

On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
wrote:

> If we had IDF for phrases, they would be super effective. The 2X weight is
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 8, 2019, at 11:08 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > the pf and qf fields are REALLY nice for this
> >
> > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood 
> > wrote:
> >
> >> I always enable phrase searching in edismax for exactly this reason.
> >>
> >> Something like:
> >>
> >>   title^16 keywords^8 text^2
> >>
> >> To deal with concepts in queries, a classifier and/or named entity
> >> extractor can be helpful. If you have a list of concepts (“controlled
> >> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
> >> term can be queried against the field matching that vocabulary.
> >>
> >> This is how LinkedIn separates people, companies, and places, for
> example.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson 
> >> wrote:
> >>>
> >>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
> >> entirely likely to do what you want either since virtually every doc
> will
> >> have “a” in it. But at least you’d get docs that have both terms.
> >>>
> >>> you may also be able to search for things like “Lamin A” _only as a
> >> phrase_ and have some luck. But this is a gnarly problem in general.
> Some
> >> people have been able to substitute synonyms and/or shingles to make
> this
> >> work at the expense of a larger index.
> >>>
> >>> This is a generic problem with context. “Lamin A” is really a
> “concept”,
> >> not just two words that happen to be near each other. Searching as a
> phrase
> >> is an OOB-but-naive way to try to make it more likely that the ranked
> >> results refer to the _concept_ of “Lamin A”. The assumption here is “if
> >> these two words appear next to each other, they’re more likely to be
> what I
> >> want”. I say “naive” because “Lamins: A new approach to...” would
> _also_ be
> >> found for a naive phrase search. (I have no idea whether such a title
> makes
> >> sense or not, but you figured that out already)...
> >>>
> >>> To do this well you’d have to dive in to NLP/Machine learning.
> >>>
> >>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
> >>>
> >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri 
> >> wrote:
> >>>>
> >>>> HI Walter and Paras
> >>>>
> >>>> I indexed it removing all the references to StopWordFilter and I went
> >> from 121 results to near 20K as the search term q="Lymphoid and a
> >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> So I
> >> don't think removing it completely is the way to go from the scenario we
> >> have, but I appreciate the suggestion…
> >>>>
> >>>> Yes the response is using fl=*
> >>>> I am trying some combinations at the moment, but yet no success.
> >>>>
> >>>> defType=edismax
> >>>> q.alt=Lymphoid and a non-Lymphoid cell
> >>>> Number of results=1599
> >>>> Quite a considerable increase, even though reasonable meaningful
> >> results.
> >>>>
> >>>> I am sorry but I didn't understand what do you want me to do exactly
> >> with the lst (??) and qf and bf.
> >>>>
> >>>> Thanks everyone with their inputs
> >>>>
> >>>>
> >>>>> On 8 Nov 2019, at 06:45, Paras Lehana 
> >> wrote:
> >>>>>
> >>>>> Hi Guilherme
> >>>>>
> >>>>> By accident, I ended up querying the using the default handler
> >> (/select) and it worked.
> >>>>>
> >>>>> You've just found the culprit. Thanks for

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

he parsed query.
> Ignore all the
> >>>>>>>> relevance calculations for the nonce, or specify “=query”
> to skip
> >>>>>>>> that part.
> >>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> explain tag
> >>>>>>>> is present.
> >>>>>>>>> I will try the searching the way you mentioned.
> >>>>>>>>>
> >>>>>>>>> Thank for your inputs
> >>>>>>>>>
> >>>>>>>>> Guilherme
> >>>>>>>>>
> >>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Fwd to another server
> >>>>>>>>>>
> >>>>>>>>>> First, your index and analysis chains are considerably
> different, this
> >>>>>>>> can easily be a source of problems. In particular, using two
> different
> >>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> this unless
> >>>>>>>> you’re totally sure you understand the consequences.
> Additionally, your use
> >>>>>>>> of the length filter is suspicious, especially since your problem
> statement
> >>>>>>>> is about the addition of a single letter term and the min length
> allowed on
> >>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> ’a’ is
> >>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> about the
> >>>>>>>> interactions.
> >>>>>>>>>>
> >>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> typos?
> >>>>>>>> Used by custom code?
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
> >>>>>>>>>>
> >>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
> that
> >>>>>>>> all the params with an equal-sign are totally ignored unless it’s
> just a
> >>>>>>>> typo.
> >>>>>>>>>>
> >>>>>>>>>> Third, the easiest way to see what’s happening under the covers
> is to
> >>>>>>>> add “=true” to the query and look at the parsed query.
> Ignore all the
> >>>>>>>> relevance calculations for the nonce, or specify “=query”
> to skip
> >>>>>>>> that part.
> >>>>>>>>>>
> >>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
> >>>>>>>> expect” is answered by looking at the “=query” output and
> the
> >>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> sure to look
> >>>>>>>> at _both_ the query and index output. Also, and very important
> about the
> >>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
> what you
> >>>>>>>> put in the text boxes have made it through the query parser
> intact and is
> >>>>>>>> analyzed by the field selected. Consider the search
> "q=field:word1 word2".
> >>>>>>>> Now you type “word1 word2” into the analysis text box and it
> looks like
> >>>>>>>> what you expect. That’s misleading because the query is _parsed_
> as
> >>>>>>>> "field:word1 default_search_field:word2”. This is where
> “=query”
> >>>>>>>> helps.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Erick
> >>>>>>>>>>
> >>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> paras.le

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

erstand the consequences. Additionally,
> your use
> >>>>>> of the length filter is suspicious, especially since your problem
> statement
> >>>>>> is about the addition of a single letter term and the min length
> allowed on
> >>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> ’a’ is
> >>>>>> filtered out in both cases, but maybe you’ve found something odd
> about the
> >>>>>> interactions.
> >>>>>>> I will investigate the min length and post the results later.
> >>>>>>>
> >>>>>>>> Second, I have no idea what this will do. Are the equal signs
> typos?
> >>>>>> Used by custom code?
> >>>>>>> This the url in my application, not solr params. That's the query
> string.
> >>>>>>>
> >>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
> that
> >>>>>> all the params with an equal-sign are totally ignored unless it’s
> just a
> >>>>>> typo.
> >>>>>>> This is part of the application. Species will be used later on in
> solr
> >>>>>> to filter out the result. That's not solr. That my app params.
> >>>>>>>
> >>>>>>>> Third, the easiest way to see what’s happening under the covers
> is to
> >>>>>> add “=true” to the query and look at the parsed query. Ignore
> all the
> >>>>>> relevance calculations for the nonce, or specify “=query” to
> skip
> >>>>>> that part.
> >>>>>>> The two json files i've sent, they are debugQuery=on and the
> explain tag
> >>>>>> is present.
> >>>>>>> I will try the searching the way you mentioned.
> >>>>>>>
> >>>>>>> Thank for your inputs
> >>>>>>>
> >>>>>>> Guilherme
> >>>>>>>
> >>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson  <mailto:erickerick...@gmail.com>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Fwd to another server
> >>>>>>>>
> >>>>>>>> First, your index and analysis chains are considerably different,
> this
> >>>>>> can easily be a source of problems. In particular, using two
> different
> >>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>>>>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>>>>> of the length filter is suspicious, especially since your problem
> statement
> >>>>>> is about the addition of a single letter term and the min length
> allowed on
> >>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> ’a’ is
> >>>>>> filtered out in both cases, but maybe you’ve found something odd
> about the
> >>>>>> interactions.
> >>>>>>>>
> >>>>>>>> Second, I have no idea what this will do. Are the equal signs
> typos?
> >>>>>> Used by custom code?
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
> >>>>>>>>
> >>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
> that
> >>>>>> all the params with an equal-sign are totally ignored unless it’s
> just a
> >>>>>> typo.
> >>>>>>>>
> >>>>>>>> Third, the easiest way to see what’s happening under the covers
> is to
> >>>>>> add “=true” to the query and look at the parsed query. Ignore
> all the
> >>>>>> relevance calculations for the nonce, or specify “=query” to
> skip
> >>>>>> that part.
> >>>>>>>>
> >>>>>>>> 90% + of the time, the question “why didn’t this query do what I
> >>>>>> expect” is answered by looking at the “=query” output and the
&g

Re: Good Open Source Front End for Solr

2019-11-07 Thread David Hastings

well thats pretty slick

On Thu, Nov 7, 2019 at 1:59 PM Erik Hatcher  wrote:

> Blacklight: http://projectblacklight.org/ 
>
> ;)
>
>
>
> > On Nov 6, 2019, at 11:16 PM, Java Developer 
> wrote:
> >
> > Hi,
> >
> > What is the best open source front-end for Solr
> >
> > Thanks
>
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread David Hastings

lts later.
> >>>>
> >>>>> Second, I have no idea what this will do. Are the equal signs typos?
> >>> Used by custom code?
> >>>> This the url in my application, not solr params. That's the query
> string.
> >>>>
> >>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> >>> all the params with an equal-sign are totally ignored unless it’s just
> a
> >>> typo.
> >>>> This is part of the application. Species will be used later on in solr
> >>> to filter out the result. That's not solr. That my app params.
> >>>>
> >>>>> Third, the easiest way to see what’s happening under the covers is to
> >>> add “=true” to the query and look at the parsed query. Ignore
> all the
> >>> relevance calculations for the nonce, or specify “=query” to skip
> >>> that part.
> >>>> The two json files i've sent, they are debugQuery=on and the explain
> tag
> >>> is present.
> >>>> I will try the searching the way you mentioned.
> >>>>
> >>>> Thank for your inputs
> >>>>
> >>>> Guilherme
> >>>>
> >>>>> On 6 Nov 2019, at 14:14, Erick Erickson 
> >>> wrote:
> >>>>>
> >>>>> Fwd to another server
> >>>>>
> >>>>> First, your index and analysis chains are considerably different,
> this
> >>> can easily be a source of problems. In particular, using two different
> >>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>> of the length filter is suspicious, especially since your problem
> statement
> >>> is about the addition of a single letter term and the min length
> allowed on
> >>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> >>> filtered out in both cases, but maybe you’ve found something odd about
> the
> >>> interactions.
> >>>>>
> >>>>> Second, I have no idea what this will do. Are the equal signs typos?
> >>> Used by custom code?
> >>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>>>
> >>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> >>> all the params with an equal-sign are totally ignored unless it’s just
> a
> >>> typo.
> >>>>>
> >>>>> Third, the easiest way to see what’s happening under the covers is to
> >>> add “=true” to the query and look at the parsed query. Ignore
> all the
> >>> relevance calculations for the nonce, or specify “=query” to skip
> >>> that part.
> >>>>>
> >>>>> 90% + of the time, the question “why didn’t this query do what I
> >>> expect” is answered by looking at the “=query” output and the
> >>> analysis page in the admin UI. NOTE: for the analysis page be sure to
> look
> >>> at _both_ the query and index output. Also, and very important about
> the
> >>> analysis page (and this is confusing) is that this _assumes_ that what
> you
> >>> put in the text boxes have made it through the query parser intact and
> is
> >>> analyzed by the field selected. Consider the search "q=field:word1
> word2".
> >>> Now you type “word1 word2” into the analysis text box and it looks like
> >>> what you expect. That’s misleading because the query is _parsed_ as
> >>> "field:word1 default_search_field:word2”. This is where “=query”
> >>> helps.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> paras.leh...@indiamart.com>
> >>> wrote:
> >>>>>>
> >>>>>> Hi Walter,
> >>>>>>
> >>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
> words
> >>> will
> >>>>>>> not be in the index, so they can never match a query.
> >>>>>>
> >>>>>>
> >>>>>> I think the OP's concern is different results when adding a
> stopword. I
> &g

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread David Hastings

no,
   

is still using stopwords and should be removed, in my opinion of course,
based on your use case may be different, but i generally axe any reference
to them at all

On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri  wrote:

> Thanks.
> Haven't I done this here ?
>positionIncrementGap="100" omitNorms="false" >
>
>
>
>
>
> words="stopwords.txt"/>
>
>
>
> > On 5 Nov 2019, at 14:15, David Hastings 
> wrote:
> >
> > Fwd to another server
> >
> > The first thing you should do is remove any reference to stop words and
> > never use them, then re-index your data and try it again.
> >
> > On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri 
> wrote:
> >
> >> Hi,
> >>
> >> I am performing a search to match a name (text_field), however this term
> >> contains 'and' and 'a' and it doesn't return any records. If i remove
> 'a'
> >> then it works.
> >> e.g
> >> Search Term: lymphoid and a non-lymphoid cell
> >> doesn't work:
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>
> >>
> >> Search term: lymphoid and non-lymphoid cell
> >> works:
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>
> >> interested in the first result
> >>
> >> schema.xml
> >>  >> indexed="true"  stored="true"   omitNorms="false"   required="true"
> >> multiValued="false"/>
> >>
> >>
> >> >> pattern="[^a-zA-Z0-9/._:]"/>
> >> >> pattern="^[/._:]+" replacement=""/>
> >> >> pattern="[/._:]+$" replacement=""/>
> >> >> pattern="[_]" replacement=" "/>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >> >> positionIncrementGap="100" omitNorms="false" >
> >>
> >>
> >>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >> >> pattern="[^a-zA-Z0-9/._:]"/>
> >> >> pattern="^[/._:]+" replacement=""/>
> >> >> pattern="[/._:]+$" replacement=""/>
> >> >> pattern="[_]" replacement=" "/>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >>
> >> stopwords.txt
> >> #Standard english stop words taken from Lucene's StopAnalyzer
> >> a
> >> b
> >> c
> >> 
> >> an
> >> and
> >> are
> >>
> >> Running SolR 6.6.2.
> >>
> >> Is there anything I could do to prevent this ?
> >>
> >> Thanks
> >> Guilherme
>
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread David Hastings

The first thing you should do is remove any reference to stop words and
never use them, then re-index your data and try it again.

On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri  wrote:

> Hi,
>
> I am performing a search to match a name (text_field), however this term
> contains 'and' and 'a' and it doesn't return any records. If i remove 'a'
> then it works.
> e.g
> Search Term: lymphoid and a non-lymphoid cell
> doesn't work:
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
>
> Search term: lymphoid and non-lymphoid cell
> works:
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
> interested in the first result
>
> schema.xml
>indexed="true"  stored="true"   omitNorms="false"   required="true"
>  multiValued="false"/>
>
> 
>  pattern="[^a-zA-Z0-9/._:]"/>
>  pattern="^[/._:]+" replacement=""/>
>  pattern="[/._:]+$" replacement=""/>
>  pattern="[_]" replacement=" "/>
> 
> 
>  words="stopwords.txt"/>
> 
>
>  positionIncrementGap="100" omitNorms="false" >
> 
> 
> 
> 
> 
>  words="stopwords.txt"/>
> 
> 
>  pattern="[^a-zA-Z0-9/._:]"/>
>  pattern="^[/._:]+" replacement=""/>
>  pattern="[/._:]+$" replacement=""/>
>  pattern="[_]" replacement=" "/>
> 
> 
>  words="stopwords.txt"/>
> 
> 
>
> stopwords.txt
> #Standard english stop words taken from Lucene's StopAnalyzer
> a
> b
> c
> 
> an
> and
> are
>
> Running SolR 6.6.2.
>
> Is there anything I could do to prevent this ?
>
> Thanks
> Guilherme

Re: Delete documents from the Solr index using SolrJ

2019-11-04 Thread David Hastings

delete them by query would do the trick unless im missing something
significant in what youre trying to do here. you can just pass in an xml
command:
'".$kill_query."'

On Mon, Nov 4, 2019 at 9:37 AM Khare, Kushal (MIND) <
kushal.kh...@mind-infotech.com> wrote:

> In my case, id won't be same.
> Suppose, I have a doc with id : 20
> Now, it's newer version would be either 20.1 or 22
> What in this case?
> -Original Message-
> From: David Hastings [mailto:hastings.recurs...@gmail.com]
> Sent: 04 November 2019 20:04
> To: solr-user@lucene.apache.org
> Subject: Re: Delete documents from the Solr index using SolrJ
>
> when you add a new document using the same "id" value as another it just
> over writes it
>
> On Mon, Nov 4, 2019 at 9:30 AM Khare, Kushal (MIND) <
> kushal.kh...@mind-infotech.com> wrote:
>
> > Could you please let me know how to achieve that ?
> >
> >
> > -Original Message-
> > From: Jörn Franke [mailto:jornfra...@gmail.com]
> > Sent: 04 November 2019 19:59
> > To: solr-user@lucene.apache.org
> > Subject: Re: Delete documents from the Solr index using SolrJ
> >
> > I don’t understand why it is not possible.
> >
> > However why don’t you simply overwrite the existing document instead
> > of
> > add+delete
> >
> > > Am 04.11.2019 um 15:12 schrieb Khare, Kushal (MIND) <
> > kushal.kh...@mind-infotech.com>:
> > >
> > > Hello mates!
> > > I want to know how we can delete the documents from the Solr index .
> > Suppose for my system, I have a document that has been indexed, now
> > its newer version is into use, so I want to use the latest one, for
> > that I want the previous one to be deleted from the index.
> > > Kindly help me a way out !
> > > I went through many articles and blogs, got the way (methods) for
> > deleting , but not actually, how to do it, because it's not possible
> > to delete every time by passing id's in around 50,000 doc system.
> > > Please suggest!
> > >
> > > 
> > >
> > > The information contained in this electronic message and any
> > > attachments
> > to this message are intended for the exclusive use of the addressee(s)
> > and may contain proprietary, confidential or privileged information.
> > If you are not the intended recipient, you should not disseminate,
> > distribute or copy this e-mail. Please notify the sender immediately
> > and destroy all copies of this message and any attachments. WARNING:
> > Computer viruses can be transmitted via email. The recipient should
> > check this email and any attachments for the presence of viruses. The
> > company accepts no liability for any damage caused by any
> > virus/trojan/worms/malicious code transmitted by this email.
> > www.motherson.com
> >
> > 
> >
> > The information contained in this electronic message and any
> > attachments to this message are intended for the exclusive use of the
> > addressee(s) and may contain proprietary, confidential or privileged
> > information. If you are not the intended recipient, you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately and destroy all copies of this message and any
> > attachments. WARNING: Computer viruses can be transmitted via email.
> > The recipient should check this email and any attachments for the
> > presence of viruses. The company accepts no liability for any damage
> > caused by any virus/trojan/worms/malicious code transmitted by this
> > email. www.motherson.com
> >
>
> 
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus/trojan/worms/malicious code transmitted
> by this email. www.motherson.com
>

Re: Delete documents from the Solr index using SolrJ

2019-11-04 Thread David Hastings

when you add a new document using the same "id" value as another it just
over writes it

On Mon, Nov 4, 2019 at 9:30 AM Khare, Kushal (MIND) <
kushal.kh...@mind-infotech.com> wrote:

> Could you please let me know how to achieve that ?
>
>
> -Original Message-
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: 04 November 2019 19:59
> To: solr-user@lucene.apache.org
> Subject: Re: Delete documents from the Solr index using SolrJ
>
> I don’t understand why it is not possible.
>
> However why don’t you simply overwrite the existing document instead of
> add+delete
>
> > Am 04.11.2019 um 15:12 schrieb Khare, Kushal (MIND) <
> kushal.kh...@mind-infotech.com>:
> >
> > Hello mates!
> > I want to know how we can delete the documents from the Solr index .
> Suppose for my system, I have a document that has been indexed, now its
> newer version is into use, so I want to use the latest one, for that I want
> the previous one to be deleted from the index.
> > Kindly help me a way out !
> > I went through many articles and blogs, got the way (methods) for
> deleting , but not actually, how to do it, because it's not possible to
> delete every time by passing id's in around 50,000 doc system.
> > Please suggest!
> >
> > 
> >
> > The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus/trojan/worms/malicious code transmitted
> by this email. www.motherson.com
>
> 
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus/trojan/worms/malicious code transmitted
> by this email. www.motherson.com
>

Re: Re: POS Tagger

2019-10-25 Thread David Hastings

oh i see what you mean, sorry, i explained it incorrectly.
 those sentences are what would be in the index, and a general search for
'rush limbaugh' would come back with results where he is an entity higher
than if it was two words in a sentence

On Fri, Oct 25, 2019 at 12:12 PM David Hastings <
hastings.recurs...@gmail.com> wrote:

> nope, i boost the fields already tagged at query time against teh query
>
> On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> So then you do run your POS tagger at query-time, Dave?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/25/19, 12:06 PM, "David Hastings" 
>> wrote:
>>
>> I use them for query boosting, so if someone searches for:
>>
>> i dont want to rush limbaugh out the door
>> vs
>> i talked to rush limbaugh through the door
>>
>> my documents where 'rush limbaugh' is a known entity (noun) and a
>> person
>> (look at the sentence, its obviously a person and the nlp finds that)
>> have
>> 'rush limbaugh' stored in a field, which is boosted on queries.  this
>> makes
>> sure results from the second query with him as a person will be
>> boosted
>> above those from the first query
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
>> nicolas.pa...@riseup.net>
>> wrote:
>>
>> > Also we are using stanford POS tagger for french. The processing
>> time is
>>     > mitigated by the spark-corenlp package which distribute the process
>> over
>> > multiple node.
>> >
>> > Also I am interesting in the way you use POS information within solr
>> > queries, or solr fields.
>> >
>> > Thanks,
>> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
>> > > ah, yeah its not the fastest but it proved to be the best for my
>> > purposes,
>> > > I use it to pre-process data before indexing, to apply more
>> metadata to
>> > the
>> > > documents in a separate field(s)
>> > >
>> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
>> > > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > > No, I meant for part-of-speech tagging __ But that's
>> interesting that
>> > you
>> > > > use StanfordNLP. I've read that it's very slow, so we are
>> concerned
>> > that it
>> > > > might not work for us at query-time. Do you use it at
>> query-time, or
>> > just
>> > > > index-time?
>> > > >
>> > > > --
>> > > > Audrey Lorberfeld
>> > > > Data Scientist, w3 Search
>> > > > IBM
>> > > > audrey.lorberf...@ibm.com
>> > > >
>> > > >
>> > > > On 10/25/19, 10:30 AM, "David Hastings" <
>> hastings.recurs...@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > > Do you mean for entity extraction?
>> > > > I make a LOT of use from the stanford nlp project, and get
>> out the
>> > > > entities
>> > > > and use them for different purposes in solr
>> > > > -Dave
>> > > >
>> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
>> > > > audrey.lorberf...@ibm.com 
>> wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > Does anyone use a POS tagger with their Solr instance
>> other than
>> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > --
>> > > > > Audrey Lorberfeld
>> > > > > Data Scientist, w3 Search
>> > > > > IBM
>> > > > > audrey.lorberf...@ibm.com
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> >
>> > --
>> > nicolas
>> >
>>
>>
>>

Re: Re: POS Tagger

2019-10-25 Thread David Hastings

nope, i boost the fields already tagged at query time against teh query

On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> So then you do run your POS tagger at query-time, Dave?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 12:06 PM, "David Hastings" 
> wrote:
>
> I use them for query boosting, so if someone searches for:
>
> i dont want to rush limbaugh out the door
> vs
> i talked to rush limbaugh through the door
>
> my documents where 'rush limbaugh' is a known entity (noun) and a
> person
> (look at the sentence, its obviously a person and the nlp finds that)
> have
> 'rush limbaugh' stored in a field, which is boosted on queries.  this
> makes
> sure results from the second query with him as a person will be boosted
> above those from the first query
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> wrote:
>
> > Also we are using stanford POS tagger for french. The processing
> time is
> > mitigated by the spark-corenlp package which distribute the process
> over
> > multiple node.
> >
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields.
> >
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my
> > purposes,
> > > I use it to pre-process data before indexing, to apply more
> metadata to
> > the
> > > documents in a separate field(s)
> > >
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > No, I meant for part-of-speech tagging __ But that's interesting
> that
> > you
> > > > use StanfordNLP. I've read that it's very slow, so we are
> concerned
> > that it
>     > > > might not work for us at query-time. Do you use it at
> query-time, or
> > just
> > > > index-time?
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > > > On 10/25/19, 10:30 AM, "David Hastings" <
> hastings.recurs...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get
> out the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > >
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Does anyone use a POS tagger with their Solr instance
> other than
> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > audrey.lorberf...@ibm.com
> > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> > --
> > nicolas
> >
>
>
>

Re: POS Tagger

2019-10-25 Thread David Hastings

I use them for query boosting, so if someone searches for:

i dont want to rush limbaugh out the door
vs
i talked to rush limbaugh through the door

my documents where 'rush limbaugh' is a known entity (noun) and a person
(look at the sentence, its obviously a person and the nlp finds that) have
'rush limbaugh' stored in a field, which is boosted on queries.  this makes
sure results from the second query with him as a person will be boosted
above those from the first query












On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris 
wrote:

> Also we are using stanford POS tagger for french. The processing time is
> mitigated by the spark-corenlp package which distribute the process over
> multiple node.
>
> Also I am interesting in the way you use POS information within solr
> queries, or solr fields.
>
> Thanks,
> On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > ah, yeah its not the fastest but it proved to be the best for my
> purposes,
> > I use it to pre-process data before indexing, to apply more metadata to
> the
> > documents in a separate field(s)
> >
> > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > No, I meant for part-of-speech tagging __ But that's interesting that
> you
> > > use StanfordNLP. I've read that it's very slow, so we are concerned
> that it
> > > might not work for us at query-time. Do you use it at query-time, or
> just
> > > index-time?
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/25/19, 10:30 AM, "David Hastings"  >
> > > wrote:
> > >
> > > Do you mean for entity extraction?
> > > I make a LOT of use from the stanford nlp project, and get out the
> > > entities
> > > and use them for different purposes in solr
> > > -Dave
> > >
> > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > Hi All,
> > > >
> > > > Does anyone use a POS tagger with their Solr instance other than
> > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > >
> > >
> > >
>
> --
> nicolas
>

Re: Re: POS Tagger

2019-10-25 Thread David Hastings

ah, yeah its not the fastest but it proved to be the best for my purposes,
I use it to pre-process data before indexing, to apply more metadata to the
documents in a separate field(s)

On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> No, I meant for part-of-speech tagging __ But that's interesting that you
> use StanfordNLP. I've read that it's very slow, so we are concerned that it
> might not work for us at query-time. Do you use it at query-time, or just
> index-time?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 10:30 AM, "David Hastings" 
> wrote:
>
> Do you mean for entity extraction?
> I make a LOT of use from the stanford nlp project, and get out the
> entities
> and use them for different purposes in solr
> -Dave
>
> On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Hi All,
> >
> > Does anyone use a POS tagger with their Solr instance other than
> > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> >
> > Thanks!
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
>
>
>

Re: POS Tagger

2019-10-25 Thread David Hastings

https://nlp.stanford.edu/

On Fri, Oct 25, 2019 at 10:29 AM David Hastings <
hastings.recurs...@gmail.com> wrote:

> Do you mean for entity extraction?
> I make a LOT of use from the stanford nlp project, and get out the
> entities and use them for different purposes in solr
> -Dave
>
> On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> Hi All,
>>
>> Does anyone use a POS tagger with their Solr instance other than
>> OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>>
>> Thanks!
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>

Re: POS Tagger

2019-10-25 Thread David Hastings

Do you mean for entity extraction?
I make a LOT of use from the stanford nlp project, and get out the entities
and use them for different purposes in solr
-Dave

On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> Does anyone use a POS tagger with their Solr instance other than
> OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>
> Thanks!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

yup.  youre going to find solr is WAY more efficient than you think when it
comes to complex queries.

On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> True...I guess another rub here is that we're using the edismax parser, so
> all of our queries are inherently OR queries. So for a query like  'the ibm
> way', the search engine would have to:
>
> 1) retrieve a document list for:
>  -->  "ibm" (this list is probably 80% of the documents)
>  -->  "the" (this list is 100%  of the english documents)
>  -- >"way"
> 2) apply edismax parser
>  --> foreach term
>  -->  -->  foreach document  in term
>  -->  -->  -->  score it
>
> So, it seems like it would take a toll on our system but maybe that's
> incorrect! (For reference, our corpus is ~5MM documents, multi-language,
> and we get ~80k-100k queries/day)
>
> Are you using edismax?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 3:11 PM, "David Hastings" 
> wrote:
>
> if you have anything close to a decent server you wont notice it all.
> im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second
> non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com
>  wrote:
>
> > Also, in terms of computational cost, it would seem that including
> most
> > terms/not having a stop ilst would take a toll on the system. For
> instance,
> > right now we have "ibm" as a stop word because it appears everywhere
> in our
> > corpus. If we did not include it in the stop words file, we would
> have to
> > retrieve every single document in our corpus and rank them. That's a
> high
> > computational cost, no?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com"
> <
> > audrey.lorberf...@ibm.com> wrote:
> >
> > Wow, thank you so much, everyone. This is all incredibly helpful
> > insight.
> >
> > So, would it be fair to say that the majority of you all do NOT
> use
> > stop words?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/9/19, 11:14 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> >
> > However, with all that said, stopwords CAN be useful in some
> > situations.  I
> > combine stopwords with the shingle factory to create
> "interesting
> > phrases"
> > (not really) that i use in "my more like this" needs.  for
> example,
> > europe for vacation
> > europe on vacation
> > will create the shingle
> > europe_vacation
> > which i can then use to relate other documents that would be
> much
> > more similar in such regard, rather than just using the
> > "interesting words"
> > europe, vacation
> >
> > with stop words, the shingles would be
> > europe_for
> > for_vacation
> > and
> > europe_on
> > on_vacation
> >
> > just something to keep in mind,  theres a lot of creative
> ways to
> > use
> > stopwords depending on your needs.  i use the above for a
> VERY
> > basic ML
> > teacher and it works way better than using stopwords,
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> > erickerick...@gmail.com>
> > wrote:
> >
> > > The theory behind stopwords is that they are “safe” to
> remove
> > when
> > > calculating relevance, so we can squeeze every last bit of
> > usefulness out
> > >

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment

On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For instance,
> right now we have "ibm" as a stop word because it appears everywhere in our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
> So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
> --
> Audrey Lorberfeld
>     Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the
> "interesting words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to
> use
> stopwords depending on your needs.  i use the above for a VERY
> basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove
> when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
> > come a long way since then and the necessity of removing
> stopwords from the
> > indexed tokens to conserve RAM and disk is much less relevant
> than it used
> > to be in “the bad old days” when the idea of stopwords was
> invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
> > totally agree that you should remove stopwords only _after_ you
> have some
> > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or not to be” if they’re all stopwords.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hey Alex,
> > >
> > > Thank you!
> > >
> > > Re: stopwords being a thing of the past due to the
> affordability of
> > hardware...can you expand? I'm not sure I understand.
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Another thing to add to the above,
> > >>
> > >> IT:ibm. In this case, we would want to maintain the colon and
> the
> > >> capitalization (otherwise “it” would be taken out as a
> stopword).
> > >>
>

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

only in my more like this tools, but they have a very specific purpose,
otherwise no

On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Wow, thank you so much, everyone. This is all incredibly helpful insight.
>
> So, would it be fair to say that the majority of you all do NOT use stop
> words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the "interesting
> words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to use
> stopwords depending on your needs.  i use the above for a VERY basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes kilobytes).
> We’ve
> > come a long way since then and the necessity of removing stopwords
> from the
> > indexed tokens to conserve RAM and disk is much less relevant than
> it used
> > to be in “the bad old days” when the idea of stopwords was invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”, but
> I’ll
> > totally agree that you should remove stopwords only _after_ you have
> some
> > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or not to be” if they’re all stopwords.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hey Alex,
>     > >
> > > Thank you!
> > >
> > > Re: stopwords being a thing of the past due to the affordability of
> > hardware...can you expand? I'm not sure I understand.
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Another thing to add to the above,
> > >>
> > >> IT:ibm. In this case, we would want to maintain the colon and the
> > >> capitalization (otherwise “it” would be taken out as a stopword).
> > >>
> > >stopwords are a thing of the past at this point.  there is no
> benefit
> > to
> > >using them now with hardware being so cheap.
> > >
> > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > >wrote:
> > >
> > >> If you don't want it to be touched by a tokenizer, how would the
> > >> protection step know that the sequence of characters you want to
> > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > >> protect"?
> > >>
> > >> What it sounds to me is that you may want to:
> > >> 1) copyField to a second field
> > >> 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> > >> 3) Run the results through something like KeepWordFilterFactory
> > >> 4) Search both fields with a boost on the second, higher-signal
> field
> > >>
> > >> The other option is to run CharacterFilter,
> &

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

oh and by 'non stop' i mean close enough for me :)

On Wed, Oct 9, 2019 at 2:59 PM David Hastings 
wrote:

> if you have anything close to a decent server you wont notice it all.  im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> Also, in terms of computational cost, it would seem that including most
>> terms/not having a stop ilst would take a toll on the system. For instance,
>> right now we have "ibm" as a stop word because it appears everywhere in our
>> corpus. If we did not include it in the stop words file, we would have to
>> retrieve every single document in our corpus and rank them. That's a high
>> computational cost, no?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
>> audrey.lorberf...@ibm.com> wrote:
>>
>> Wow, thank you so much, everyone. This is all incredibly helpful
>> insight.
>>
>> So, would it be fair to say that the majority of you all do NOT use
>> stop words?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/9/19, 11:14 AM, "David Hastings" 
>> wrote:
>>
>> However, with all that said, stopwords CAN be useful in some
>> situations.  I
>> combine stopwords with the shingle factory to create "interesting
>> phrases"
>> (not really) that i use in "my more like this" needs.  for
>> example,
>> europe for vacation
>> europe on vacation
>> will create the shingle
>> europe_vacation
>> which i can then use to relate other documents that would be much
>> more similar in such regard, rather than just using the
>> "interesting words"
>> europe, vacation
>>
>> with stop words, the shingles would be
>> europe_for
>> for_vacation
>> and
>> europe_on
>> on_vacation
>>
>> just something to keep in mind,  theres a lot of creative ways to
>> use
>> stopwords depending on your needs.  i use the above for a VERY
>> basic ML
>> teacher and it works way better than using stopwords,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>> erickerick...@gmail.com>
>> wrote:
>>
>> > The theory behind stopwords is that they are “safe” to remove
>> when
>> > calculating relevance, so we can squeeze every last bit of
>> usefulness out
>> > of very constrained hardware (think 64K of memory. Yes
>> kilobytes). We’ve
>> > come a long way since then and the necessity of removing
>> stopwords from the
>> > indexed tokens to conserve RAM and disk is much less relevant
>> than it used
>> > to be in “the bad old days” when the idea of stopwords was
>> invented.
>> >
>> > I’m not quite so confident as Alex that there is “no benefit”,
>> but I’ll
>> > totally agree that you should remove stopwords only _after_ you
>> have some
>> > evidence that removing them is A Good Thing in your situation.
>> >
>> > And removing stopwords leads to some interesting corner cases.
>> Consider a
>>     > search for “to be or not to be” if they’re all stopwords.
>> >
>> > Best,
>> > Erick
>> >
>> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>> > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > Hey Alex,
>> > >
>> > > Thank you!
>> > >
>> > > Re: stopwords being a thing of the past due to the
>> affordability of
>> > hardware...can you expand? I'm not sure I understand.
>> > >
>> > > --
>> > > Audrey Lorberfeld
>>

Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

Yeah, I dont use it as a search, only well, finding more documents like
that one :) . for my purposes i tested between 2 to 5 part shingles and
ended up that the 2 part was actually giving me better results, for my use
case, than using any more.

I dont suppose you could point me to any of the phrase IDF documentation
for solr by chance?  That would be fun to poke around with.

On Wed, Oct 9, 2019 at 2:49 PM Walter Underwood 
wrote:

> We did something like that with Infoseek and Ultraseek. We had a set of
> “glue words” that made noun phrases and indexed patterns like “noun glue
> noun”
> as single tokens.
>
> I remember Doug Cutting saying that Nutch did something similar using
> pairs,
> but using that as a prefilter instead of as a relevance term.
>
> This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek
> always
> beat Google in relevance tests, probably because of phrase IDF.
>
> More Like This could do the same thing, but it seems to be really slow and
> not especially useful as a search component.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Oct 9, 2019, at 8:14 AM, David Hastings 
> wrote:
> >
> > However, with all that said, stopwords CAN be useful in some
> situations.  I
> > combine stopwords with the shingle factory to create "interesting
> phrases"
> > (not really) that i use in "my more like this" needs.  for example,
> > europe for vacation
> > europe on vacation
> > will create the shingle
> > europe_vacation
> > which i can then use to relate other documents that would be much
> > more similar in such regard, rather than just using the "interesting
> words"
> > europe, vacation
> >
> > with stop words, the shingles would be
> > europe_for
> > for_vacation
> > and
> > europe_on
> > on_vacation
> >
> > just something to keep in mind,  theres a lot of creative ways to use
> > stopwords depending on your needs.  i use the above for a VERY basic ML
> > teacher and it works way better than using stopwords,
> >
> > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
> > wrote:
> >
> >> The theory behind stopwords is that they are “safe” to remove when
> >> calculating relevance, so we can squeeze every last bit of usefulness
> out
> >> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> >> come a long way since then and the necessity of removing stopwords from
> the
> >> indexed tokens to conserve RAM and disk is much less relevant than it
> used
> >> to be in “the bad old days” when the idea of stopwords was invented.
> >>
> >> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> >> totally agree that you should remove stopwords only _after_ you have
> some
> >> evidence that removing them is A Good Thing in your situation.
> >>
> >> And removing stopwords leads to some interesting corner cases. Consider
> a
> >> search for “to be or not to be” if they’re all stopwords.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> >> audrey.lorberf...@ibm.com  wrote:
> >>>
> >>> Hey Alex,
> >>>
> >>> Thank you!
> >>>
> >>> Re: stopwords being a thing of the past due to the affordability of
> >> hardware...can you expand? I'm not sure I understand.
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> audrey.lorberf...@ibm.com
> >>>
> >>>
> >>> On 10/8/19, 1:01 PM, "David Hastings" 
> >> wrote:
> >>>
> >>>   Another thing to add to the above,
> >>>>
> >>>> IT:ibm. In this case, we would want to maintain the colon and the
> >>>> capitalization (otherwise “it” would be taken out as a stopword).
> >>>>
> >>>   stopwords are a thing of the past at this point.  there is no benefit
> >> to
> >>>   using them now with hardware being so cheap.
> >>>
> >>>   On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >>>   wrote:
> >>>
> >>>> If you don't want it to be touched by a tokenizer, how would the
> >>>> protection step know that the sequence of characters you want to
> >>>> protect is "IT:ibm" and n

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

However, with all that said, stopwords CAN be useful in some situations.  I
combine stopwords with the shingle factory to create "interesting phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness out
> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> come a long way since then and the necessity of removing stopwords from the
> indexed tokens to conserve RAM and disk is much less relevant than it used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> totally agree that you should remove stopwords only _after_ you have some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal field
> >>
> >> The other option is to run CharacterFilter,
> >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on both indexing and query, they will
> >> still match. You may have to have a bunch of them or write some sort
> >> of lookup map.
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >> audrey.lorberf...@ibm.com  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> This is likely a rudimentary question, but I can’t seem to find a
> >> straight-forward answer on forums or the documentation…is there a way to
> >> protect tokens from ANY analysis? I know things like the
> >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> some
> >> terms we don’t even want our tokenizer to touch. Mostly, these are
> >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> >> maintain the colon and the capitalization (otherwise “it” would be taken
> >> out as a stopword).
> >>>
> >>> Any advice is appreciated!
> >>>
> >>> Thank you,
> >>> Audrey
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> audrey.lorberf...@ibm.com
> >>>
> >>
> >
> >
>
>

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings

another add on, as the previous two were pretty much spot on:

https://www.google.com/search?rlz=1C5CHFA_enUS814US819=ACYBGNTi2tQTQH6TycDKwRNEn9g2km9awg%3A1570632176627=8PGdXa7tJeem_QaatJ_oAg=drive+in=drive+in_l=psy-ab.3..0l10.35669.36730..37042...0.4..1.434.1152.4j3j4-1..01..gws-wiz...0i71j35i39j0i273j0i67j0i131j0i273i70i249.agjl1cqAyog=0ahUKEwiupdfntI_lAhVnU98KHRraBy0Q4dUDCAs=5

vs

https://www.google.com/search?rlz=1C5CHFA_enUS814US819=ACYBGNRFNjzWADDR7awohPfgg8qGXqOlmg%3A1570632182338=9vGdXZ2VFKW8ggeuw73IDQ=drive+on=drive+on_l=psy-ab.3..0l10.35301.37396..37917...0.4..0.83.590.82..01..gws-wiz...0i71j35i39j0i273j0i131j0i67j0i3.34FIDQtvfOE=0ahUKEwid6LPqtI_lAhUlnuAKHa5hD9kQ4dUDCAs=5


On Wed, Oct 9, 2019 at 10:41 AM Alexandre Rafalovitch 
wrote:

> Stopwords (it was discussed on mailing list several times I recall):
> The ideas is that it used to be part of the tricks to make the index
> as small as possible to allow faster search. Stopwords being the most
> common words
> This days, disk space is not an issue most of the time and there have
> been many optimizations to make stopwords less relevant. Plus, like
> you said, sometimes the stopword management actively gets in the way.
> Here is an interesting - if old - article about it too:
>
> https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be
>
> Regards,
>Alex.
>
> On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> > Another thing to add to the above,
> > >
> > > IT:ibm. In this case, we would want to maintain the colon and the
> > > capitalization (otherwise “it” would be taken out as a stopword).
> > >
> > stopwords are a thing of the past at this point.  there is no
> benefit to
> > using them now with hardware being so cheap.
> >
> > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > If you don't want it to be touched by a tokenizer, how would the
> > > protection step know that the sequence of characters you want to
> > > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > > protect"?
> > >
> > > What it sounds to me is that you may want to:
> > > 1) copyField to a second field
> > > 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> > > 3) Run the results through something like KeepWordFilterFactory
> > > 4) Search both fields with a boost on the second, higher-signal
> field
> > >
> > > The other option is to run CharacterFilter,
> > > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
> > > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> > > term365". As long as it is done on both indexing and query, they
> will
> > > still match. You may have to have a bunch of them or write some
> sort
> > > of lookup map.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This is likely a rudimentary question, but I can’t seem to find a
> > > straight-forward answer on forums or the documentation…is there a
> way to
> > > protect tokens from ANY analysis? I know things like the
> > > KeywordMarkerFilterFactory protect tokens from stemming, but we
> have some
> > > terms we don’t even want our tokenizer to touch. Mostly, these are
> > > IBM-specific acronyms, such as IT:ibm. In this case, we would want
> to
> > > maintain the colon and the capitalization (otherwise “it” would be
> taken
> > > out as a stopword).
> > > >
> > > > Any advice is appreciated!
> > > >
> > > > Thank you,
> > > > Audrey
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > >
> >
> >
>

Re: Protecting Tokens from Any Analysis

2019-10-08 Thread David Hastings

Another thing to add to the above,
>
> IT:ibm. In this case, we would want to maintain the colon and the
> capitalization (otherwise “it” would be taken out as a stopword).
>
stopwords are a thing of the past at this point.  there is no benefit to
using them now with hardware being so cheap.

On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
wrote:

> If you don't want it to be touched by a tokenizer, how would the
> protection step know that the sequence of characters you want to
> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> protect"?
>
> What it sounds to me is that you may want to:
> 1) copyField to a second field
> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> 3) Run the results through something like KeepWordFilterFactory
> 4) Search both fields with a boost on the second, higher-signal field
>
> The other option is to run CharacterFilter,
> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> term365". As long as it is done on both indexing and query, they will
> still match. You may have to have a bunch of them or write some sort
> of lookup map.
>
> Regards,
>Alex.
>
> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hi All,
> >
> > This is likely a rudimentary question, but I can’t seem to find a
> straight-forward answer on forums or the documentation…is there a way to
> protect tokens from ANY analysis? I know things like the
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
> terms we don’t even want our tokenizer to touch. Mostly, these are
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> maintain the colon and the capitalization (otherwise “it” would be taken
> out as a stopword).
> >
> > Any advice is appreciated!
> >
> > Thank you,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
>

Re: SolR: How to sort (or boost) by Availability dates

2019-09-24 Thread David Hastings

It sounds like you want to do a normal search but only show available
items.  You could simply just add a fq parameter with dynamic values based
on the current date
fq=avaiable_from:[$todays_date TO *] AND available_to[* TO $todays_date]

On Tue, Sep 24, 2019 at 9:41 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi Federico,
>
> I am not sure exactly what syntax would get you the functionality that
> you're looking for, but I'd recommend writing a boost function. That's what
> we're doing right now for boosting more recent results in our search
> engine. You'd somehow have to work with date math and possibly make a copy
> field to store the answer of the mathematical expression that would
> incorporate the NOW part...
>
> Our boost function is
> "recip(div(ms(NOW,date_discount),262800),1,24,24." It goes in the "bf"
> parameter when using the edismax parser. Our function translates to "max
> boost set to 1 for new docs, down to .4 after 3 years." We came up with the
> time frame of the boost after creating a histogram of our corpus's
> "update_date" field values (copied to the "date_discount" field) and
> finding that monthly binning gave us the most normal distribution (as
> opposed to weekly or yearly).
>
> We came up this solution after lots of surfing Solr forums and reading a
> lot about date math (
> https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/working-with-dates.html#date-math)
> and boost functions (
> https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/the-dismax-query-parser.html#bf-boost-functions-parameter).
>
>
> Currently, we are running a grid search optimized for nDCG that runs
> ~1x/week to give us the optimal a,b constants to sub out for the 24s in the
> function above. We plan to change this to a linear model in the future to
> cut down on the time it takes to run.
>
> Hopefully this gives you a nice starting place!
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 9/24/19, 5:47 AM, "digi_business" 
> wrote:
>
> i'm facing a big problem in my SolR DB.
> My objects have a datetime field "Available_From" and a datetime field
> "Available_To". We also have a "Ranking" field for the sorting that we
> usually use desc.
> I can search correctly with direct queries (eg. give me all the items
> that
> are available at the moment) but when i do a regular search by other
> criteria i cannot find a way to show the items that result "available
> NOW"
> in the first places in the specific search results, usually sorted by
> "Ranking" field.
> How can i do this? Am I forced to write some java classes (the nearest
> thing
> i've found is there
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40devchaitu18_sorting-2Dbased-2Don-2Da-2Dcustom-2Dfunction-2Din-2Dsolr-2Dc94ddae99a12=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=iCvyvST3PYrt8QATUq_UaCpoHECIsgvkQgpZ1073OLg=bgStMXTIXUGU1HG4dIgafvDG7gDRfLDHxqV9tiPsa_8=
> )
> or is there a way to do with standard SolR queries?
> Will boosting work? If yes, how can i boost by the 2 "Available_From"
> and
> "Available_To" fields verified at the same time, and then show the
> other
> results sorted by "Ranking" desc ?
> Thanks in advance to everyone!
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=iCvyvST3PYrt8QATUq_UaCpoHECIsgvkQgpZ1073OLg=uyc4rT6s7dUYCpfdc4jZseKQ7N8HAzBNof59kkyRsxg=
>
>
>

Re: Moving to solrcloud from single instance

2019-08-12 Thread David Hastings

I actually never had a problem with the index being larger than the memory
for a standalone instance, but the entire index is on an SSD at least one
my end

On Mon, Aug 12, 2019 at 3:43 PM Erie Data Systems 
wrote:

> I am starting the planning stages of moving from a single instance of solr
> 8 to a solrcloud implementation.
>
> Currently I have a 148GB index on a single dedicated server w 96gb ram @ 16
> cores /2.4ghz ea. + SSD disk. The search is fast but obviously the index
> size is greater than the physical memory, which to my understanding is not
> a good thing.
>
> I have a lot of experience with single instance but none with solrcloud. I
> have 3 machines (other than my main 1) with the exact same hardware 96gb *
> 3 essentially which should be plenty.
>
> My issue is that im not sure where to go to learn how to set this up, how
> many shards, how many replicas, etc and would rather hire somebody or
> something (detailed video or document)  to guide me through the process,
> and make decisions along the way...For example I think a shard is a piece
> of the index... but I dont even know how to decide how many replicas or
> what they are .
>
> Thanks everyone.
> -Craig
>

Re: more like this query parser with faceting

2019-08-12 Thread David Hastings

should be fine,
https://cwiki.apache.org/confluence/display/solr/MoreLikeThisHandler

for more info

On Mon, Aug 12, 2019 at 2:49 PM Szűcs Roland 
wrote:

> Hi David,
> Thanks the fast reply. Am I right that I can combine fq with mlt only if I
> use more like this as a query parser?
>
> Is there a way to achieve the same with mlt as a request handler?
> Roland
>
> David Hastings  ezt írta (időpont: 2019.
> aug.
> 12., H, 20:44):
>
> > The easiest way will be to pass in a filter query (fq)
> >
> > On Mon, Aug 12, 2019 at 2:40 PM Szűcs Roland <
> szucs.rol...@bookandwalk.hu>
> > wrote:
> >
> > > Hi All,
> > >
> > > Is there any tutorial or example how to use more like this
> functionality
> > > when we have some other constraints set by the user through faceting
> > > parameters like price range, or product category for example?
> > >
> > > Cheers,
> > > Roland
> > >
> >
>

Re: more like this query parser with faceting

2019-08-12 Thread David Hastings

The easiest way will be to pass in a filter query (fq)

On Mon, Aug 12, 2019 at 2:40 PM Szűcs Roland 
wrote:

> Hi All,
>
> Is there any tutorial or example how to use more like this functionality
> when we have some other constraints set by the user through faceting
> parameters like price range, or product category for example?
>
> Cheers,
> Roland
>

Re: Ranking

2019-07-27 Thread David Hastings

I can’t imagine this is actually true unless you have a default copy field and 
I is in one of them. Also the letter “I” is a bizarre test case

> On Jul 27, 2019, at 3:40 PM, Steven White  wrote:
> 
> Hi everyone,
> 
> I have 2 files like so:
> 
> FA has the letter "i" only 2 times, and the file size is 54,246 bytes
> FB has the letter "i" 362 times and the file size is 9,953
> 
> When I search on the letter "i" FB is ranked lower which confuses me
> because I was under the impression the occurrences of the term in a
> document and the document size is a factor as such I was expecting FB to
> rank higher.  Did I get this right?  If not, what's causing FB to rank
> lower?
> 
> I'm on Solr 8.1
> 
> Thanks
> 
> Steven

Re: Getting list of unique values in a field

2019-07-12 Thread David Hastings

i found this:

https://stackoverflow.com/questions/14485031/faceting-using-solrj-and-solr4

and this

https://www.programcreek.com/java-api-examples/?api=org.apache.solr.client.solrj.response.FacetField


just from a google search

On Fri, Jul 12, 2019 at 9:46 AM Steven White  wrote:

> Thanks David.  But is there a SolrJ sample code on how to do this?  I need
> to see one, or at least the API, so I know how to make the call.
>
> Steven
>
> On Fri, Jul 12, 2019 at 9:42 AM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > just use a facet on the field should work yes?
> >
> > On Fri, Jul 12, 2019 at 9:39 AM Steven White 
> wrote:
> >
> > > Hi everyone,
> > >
> > > One of my indexed field is as follows:
> > >
> > >  > > multiValued="false" indexed="true" required="true" stored="false"/>
> > >
> > > It holds the file extension of the files I'm indexing.  That is, let us
> > say
> > > I indexed 10 million files and the result of such indexing, the field
> > > CC_FILE_EXT will now have the file extension.  In my case the unique
> file
> > > extension list is about 300.
> > >
> > > Using SolrJ, is there a quick and fast way for me to get back all the
> > > unique values this field has across all of my document?  I don't and
> > cannot
> > > scan all the 10 million indexed documents in Solr to build that list.
> > That
> > > would be very inefficient.
> > >
> > > Thanks,
> > >
> > > Steven
> > >
> >
>

Re: Getting list of unique values in a field

2019-07-12 Thread David Hastings

just use a facet on the field should work yes?

On Fri, Jul 12, 2019 at 9:39 AM Steven White  wrote:

> Hi everyone,
>
> One of my indexed field is as follows:
>
>  multiValued="false" indexed="true" required="true" stored="false"/>
>
> It holds the file extension of the files I'm indexing.  That is, let us say
> I indexed 10 million files and the result of such indexing, the field
> CC_FILE_EXT will now have the file extension.  In my case the unique file
> extension list is about 300.
>
> Using SolrJ, is there a quick and fast way for me to get back all the
> unique values this field has across all of my document?  I don't and cannot
> scan all the 10 million indexed documents in Solr to build that list.  That
> would be very inefficient.
>
> Thanks,
>
> Steven
>

Re: Large Filter Query

2019-06-26 Thread David Hastings

yeah there is a performance hit but that is expected.  in my scenario i
pass sometimes a few thousand using this method, but i pre-process my
results since its a set.  you will not have any issues if you are using
POST with the uri length.

On Wed, Jun 26, 2019 at 3:02 PM Lucky Sharma  wrote:

> Thanks, Jagdish
> But what if we need to perform search and filtering on those 1.5k doc
> ids results, also for URI error, we can go with the POST approach,
> and what if the data is not sharded.
>
> Regards,
> Lucky Sharma
>
> On Thu, Jun 27, 2019 at 12:28 AM jai dutt 
> wrote:
> >
> > 1. No Solr is not for id search.  rdms a better option.
> > 2. Yes correct it going to impact query  performance. And you may got
> > large uri error.
> > 3 ya you can pass ids internally by writing any custom parser.or divide
> > data into different shard.
> >
> >
> >
> > On Thu, 27 Jun, 2019, 12:01 AM Lucky Sharma,  wrote:
> >
> > > Hi all,
> > >
> > > What we are doing is, we will be having a set of unique Ids of solr
> > > document at max 1500, we need to run faceting and sorting among them.
> > > there is no direct search involved.
> > > It's a head-on search since we already know the document unique keys
> > > beforehand.
> > >
> > > 1. Is Solr a better use case for such kind of problem?
> > > 2. Since we will be passing 1500 unique document ids, As per my
> > > understanding it will impact query tree as it will grow bigger. Will
> > > there be any other impacts?
> > > 3. Is it wise to use or solve the situation in this way?
> > >
> > >
> > > --
> > > Warm Regards,
> > >
> > > Lucky Sharma
> > >
>
>
>
> --
> Warm Regards,
>
> Lucky Sharma
> Contact No :+91 9821559918
>

Re: Large Filter Query

2019-06-26 Thread David Hastings

you can use the !terms operator and send them separated by a comma:

{!terms f=id}id1,id2,..id1499,id1500

and run facets normally


On Wed, Jun 26, 2019 at 2:31 PM Lucky Sharma  wrote:

> Hi all,
>
> What we are doing is, we will be having a set of unique Ids of solr
> document at max 1500, we need to run faceting and sorting among them.
> there is no direct search involved.
> It's a head-on search since we already know the document unique keys
> beforehand.
>
> 1. Is Solr a better use case for such kind of problem?
> 2. Since we will be passing 1500 unique document ids, As per my
> understanding it will impact query tree as it will grow bigger. Will
> there be any other impacts?
> 3. Is it wise to use or solve the situation in this way?
>
>
> --
> Warm Regards,
>
> Lucky Sharma
>

Re: Re: Query takes a long time Solr 6.1.0

2019-06-07 Thread David Hastings

There isnt anything wrong aside from your query is poorly thought out.

On Fri, Jun 7, 2019 at 11:04 AM vishal patel 
wrote:

> Any one is looking my issue??
>
> Get Outlook for Android
>
> 
> From: vishal patel
> Sent: Thursday, June 6, 2019 5:15:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query takes a long time Solr 6.1.0
>
> Thanks for your reply.
>
> > How much index data is on one server with 256GB of memory?  What is the
> > max heap size on the Solr instance?  Is there only one Solr instance?
>
> One server(256GB RAM) has two below Solr instance and other application
> also
> 1) shards1 (80GB heap ,790GB Storage, 449GB Indexed data)
> 2) replica of shard2 (80GB heap, 895GB Storage, 337GB Indexed data)
>
> The second server(256GB RAM and 1 TB storage) has two below Solr instance
> and other application also
> 1) shards2 (80GB heap, 790GB Storage, 338GB Indexed data)
> 2) replica of shard1 (80GB heap, 895GB Storage, 448GB Indexed data)
>
> Both server memory and disk usage:
> https://drive.google.com/drive/folders/11GoZy8C0i-qUGH-ranPD8PCoPWCxeS-5
>
> Note: Average 40GB heap used normally in each Solr instance. when replica
> gets down at that time disk IO are high and also GC pause time above 15
> seconds. We can not identify the exact issue of replica recovery OR down
> from logs. due to the GC pause? OR due to disk IO high? OR due to
> time-consuming query? OR due to heavy indexing?
>
> Regards,
> Vishal
> 
> From: Shawn Heisey 
> Sent: Wednesday, June 5, 2019 7:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query takes a long time Solr 6.1.0
>
> On 6/5/2019 7:08 AM, vishal patel wrote:
> > I have attached RAR file but not attached properly. Again attached txt
> file.
> >
> > For 2 shards and 2 replicas, we have 2 servers and each has 256 GB ram
> > and 1 TB storage. One shard and another shard replica in one server.
>
> You got lucky.  Even text files usually don't make it to the list --
> yours did this time.  Use a file sharing website in the future.
>
> That is a massive query.  The primary reason that Lucene defaults to a
> maxBooleanClauses value of 1024, which you are definitely exceeding
> here, is that queries with that many clauses tend to be slow and consume
> massive levels of resources.  It might not be possible to improve the
> query speed very much here if you cannot reduce the size of the query.
>
> Your query doesn't look like it is simple enough to replace with the
> terms query parser, which has better performance than a boolean query
> with thousands of "OR" clauses.
>
> How much index data is on one server with 256GB of memory?  What is the
> max heap size on the Solr instance?  Is there only one Solr instance?
>
> The screenshot mentioned here will most likely relay all the info I am
> looking for.  Be sure the sort is correct:
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
> You will not be able to successfully attach the screenshot to a message.
>   That will require a file sharing website.
>
> Thanks,
> Shawn
>

Re: strange behavior

2019-06-06 Thread David Hastings

audit_author.name:Burley,%20S.K.

translates to
audit_author.name:Burley, DEFAULT_OPERATOR DEFAULT_FIELD:S.K.




On Thu, Jun 6, 2019 at 2:46 PM Wendy2  wrote:

>
> Hi,
>
> Why "AND" didn't work anymore?
>
> I use Solr 7.3.1 and edismax parser.
> Could someone explain to me why the following query doesn't work any
> more?
> What could be the cause? Thanks!
>
> q=audit_author.name:Burley,%20S.K.%20AND%20entity.type:polymer
>
> It worked previously but now returned very lower number of documents.
> I had to use "fq" to make it work correctly:
>
> q=audit_author.name:Burley,%20S.K.=entity.type:polymer=1
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Empty rows from /export?

2019-05-31 Thread David Hastings

> Ah. So docValues are managed by Solr outside of Lucene. Interesting.

i was under the impression docValues are in lucene, and he is just saying
that an optimize is not a re-index, its just taking the actual files that
already exist in your index and arranging them and removing deletions, an
optimize doesnt re-read the schema and re-index content

On Fri, May 31, 2019 at 1:59 PM Walter Underwood 
wrote:

> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
>
> That actually answers a question I had not asked yet. I was curious if it
> was safe to change the id field to docValues without reindexing if we never
> sorted on it. It looks like fetching the value won’t work until everything
> is reindexed.
>
> It seems like this would be a useful thing to have supported, migrating a
> field to docValues.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 31, 2019, at 5:00 AM, Erick Erickson 
> wrote:
> >
> > bq. but I optimized all the cores, which should rewrite every segment as
> docValues.
> >
> > Not true. Optimize is a Lucene level force merge. Dealing with segments,
> i.e. merging and the like, is a low-level Lucene operation and Lucene has
> no notion of a schema. So a change you made to the schema is irrelevant to
> merging.
> >
> > You have to have something at the Solr level that does some magic for
> this to work. Take a look at UninvertDocValuesMergePolicyFactory if you
> have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do
> not know what the behavior would be on an index that is “mixed”, i.e. one
> that already has segments with some docs having DV entries and some not.
> >
> > Best,
> > Erick
> >
> >> On May 31, 2019, at 12:35 AM, Walter Underwood 
> wrote:
> >>
> >> That field was changed to docValues, but I optimized all the cores,
> which should rewrite every segment as docValues.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On May 30, 2019, at 7:37 PM, Erick Erickson 
> wrote:
> >>>
> >>> This is odd. The only reason I know of that would happen is if there
> were no docValues for that field in those documents. By any chance were
> docValues added to an existing index without totally reindexing into a new
> collection?
> >>>
> >>> What happens if you just query the collection rather than the
> individual core? I’m thinking using a streaming expression as a check…..
> >>>
>  On May 30, 2019, at 6:41 PM, Walter Underwood 
> wrote:
> 
>  3/4 of the documents I’m getting back from /export are empty. This
> collection has four shards, so I’m querying the leader core on each shard
> with /export. The results start like this:
> 
> 
> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
> 
>  The final 1/4 of the results have UUIDs (the ID type). The id field
> is stored as docValues. This is the URL.
> 
> 
> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*=false=shard1=id=id+asc
> 
>  Running 6.6.2, Solr Cloud. The total number of non-null ids from all
> four shards is a bit less than 1/4 of the document count.
> 
>  Any ideas about what is going on?
> 
>  wunder
>  Walter Underwood
>  wun...@wunderwood.org
>  http://observer.wunderwood.org/  (my blog)
> 
> >>>
> >>
> >
>
>

Re: Streaming Expression: get the value of the array at the specified position

2019-05-10 Thread David Hastings

no.

On Fri, May 10, 2019 at 11:09 AM Nazerke S  wrote:

> Hi,
>
> I am interested in getting the value of the array at the given index. For
> example,
>
> let(echo="b", a=array(1,2,3,4,5), b=getAt(a, 2))  should return 3.
>
> Is there a way to get access an array's element by indexing?
>
> Thanks!
>
> __Nazerke
>

Re: Solr query takes a too much time in Solr 6.1.0

2019-05-10 Thread David Hastings

first inclination is your index is cold.

On Fri, May 10, 2019 at 9:32 AM vishal patel 
wrote:

> We have 2 shards and 2 replicas in Live environment. we have multiple
> collections.
> Some times some query takes much time(QTime=52552).  There are so many
> documents indexing and searching within milliseconds.
> When we executed the same query again using admin panel, it does not take
> a much time and it completes within 20 milliseconds.
>
> My Solr Logs :
> 2019-05-10 09:48:56.744 INFO  (qtp1239731077-128223) [c:actionscomments
> s:shard1 r:core_node1 x:actionscomments] o.a.s.c.S.Request
> [actionscomments]  webapp=/solr path=/select
> params={q=%2Bproject_id:(2102117)%2Brecipient_id:(4642365)+%2Bentity_type:(1)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2B(is_formtype_active:true)+%2B(appType:1)=
> s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1
> }
> hits=198 status=0 QTime=52552
> 2019-05-10 09:48:56.744 INFO  (qtp1239731077-127998) [c:actionscomments
> s:shard1 r:core_node1 x:actionscomments] o.a.s.c.S.Request
> [actionscomments]  webapp=/solr path=/select
> params={q=%2Bproject_id:(2102117)%2Brecipient_id:(4642365)+%2Bentity_type:(1)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2Bdue_date:[2019-05-09T19:30:00Z+TO+2019-05-09T19:30:00Z%2B1DAY]+%2B(is_formtype_active:true)+%2B(appType:1)=
> s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1
> }
> hits=0 status=0 QTime=51970
> 2019-05-10 09:48:56.746 INFO  (qtp1239731077-128224) [c:actionscomments
> s:shard1 r:core_node1 x:actionscomments] o.a.s.c.S.Request
> [actionscomments]  webapp=/solr path=/select
> params={q=%2Bproject_id:(2121600+2115171+2104206)%2Brecipient_id:(2834330)+%2Bentity_type:(2)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2Bdue_date:[2019-05-10T00:00:00Z+TO+2019-05-10T00:00:00Z%2B1DAY]=
> s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1
> }
> hits=98 status=0 QTime=51402
>
>
> My schema fields below :
>
>  required="true" multiValued="false"/>
> 
> 
> 
> 
> 
> 
>  stored="true" />
> 
> 
>
> What could be a problem here? why the query takes too much time at that
> time?
>
> Sent from Outlook
>

Re: Search using filter query on multivalued fields

2019-05-03 Thread David Hastings

another option is to index dynamically, so you would index in this case, or
this is what i would do:
INGREDIENT_SALT_i:40
INGREDIENT_EGG_i:20
etc

and query
INGREDIENT_SALT_i:[20 TO *]
or an arbitrary max value, since these are percentages

INGREDIENT_SALT_i:[20 TO 100]


On Fri, May 3, 2019 at 12:01 PM Erick Erickson 
wrote:

> There is no way to do this with the setup you describe. That is, there’s
> no way to say “only use the third element of a multiValued field”.
>
> What I’d do is index (perhaps in a separate field) with payloads, so you
> have input like SALT|20, then use some of the payload functionality to make
> this happen. See: https://lucidworks.com/2017/09/14/solr-payloads/
>
> There are some other strategies that are simpler, one could index (again,
> perhaps in a separate field) SALT_20. Then you can form filter queries like
> “fq=ingredient:[SALT_20 TO *]. That’s not very flexible and you have to
> normalize (i.e. 1% couldn’t be SALT_1), so “it depends”.
>
> The point is that you have to index cleverly to do what you want.
>
> Best,
> Erick
>
> > On May 3, 2019, at 6:26 AM, Srinivas Kashyap 
> wrote:
> >
> > Hi,
> >
> > I have indexed data as shown below using DIH:
> >
> > "INGREDIENT_NAME": [
> >  "EGG",
> >  "CANOLA OIL",
> >  "SALT"
> >],
> > "INGREDIENT_NO": [
> >  "550",
> >  "297",
> >  "314"
> >],
> > "COMPOSITION PERCENTAGE": [
> >  20,
> >  60,
> >  40
> >],
> >
> > Similar to this, many other records are also indexed. These are
> multi-valued fields.
> >
> > I have a requirement to search all the records which has ingredient name
> salt and it's composition percentage is more than 20.
> >
> > How do I write a filter query for this?
> >
> > P.S: I should only fetch records, whose Salt Composition percentage is
> more than 20 and not other percentages.
> >
> > Thanks and Regards,
> > Srinivas Kashyap
> > 
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> > No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>
>

Re: Compound Primary Keys

2019-04-24 Thread David Hastings

another thing to consider doing is just merge the two fields into the id
value:
"id": "USER_RECORD_12334",
since its a string.



On Wed, Apr 24, 2019 at 2:35 PM Gus Heck  wrote:

> Hi Vivek
>
> Solr is not a database, nor should one try to use it as such. You'll need
> to adjust your thinking some in order to make good use of Solr. In Solr
> there is normally an id field and it should be unique across EVERY document
> in the entire collection. Thus there's no concept of a primary key, because
> there are no tables. In some situations (streaming expressions for example)
> you might want to use collections like tables, creating a collection per
> data type, but there's no way to define uniqueness in terms of more than
> one field within a collection. If your data comes from a database with
> complex keys, concatenating the values to form the single unique ID is a
> possibility. If you form keys that way of course you also want to retain
> the values as individual fields. This duplication might seem odd from a
> database perspective where one often works hard to normalize data, but for
> search, denormalization is very common. The focus with search engines is
> usually speed of retrieval rather than data correctness. Solr should serve
> as an index into some other canonical source of truth for your data, and
> that source of truth should be in charge of guaranteeing data correctness.
>
> Another alternative is to provide a field that denotes the type (table) for
> the document (such as id_type in your example). In that case, all queries
> looking for a specific object type as a result should add a filter (fq
> parameter) to denote the "table" and you may want to store a db_id field to
> correlate the data with a database if that's where it came from. When using
> the field/filter strategy you tend to inflate the number of fields in the
> index with some fields being sparsely populated and this can have some
> performance implications, and furthermore if one "table" gets updated
> frequently you wind up interfering with the caching for all data due to
> frequent opening of new searchers. On the plus side such a strategy makes
> it easier to query across multiple types simultaneously, so these
> considerations should be balanced against your usage patterns, performance
> needs, ease of management and ease of programming.
>
> Best,
> Gus
>
> On Fri, Apr 19, 2019 at 2:10 PM Vivekanand Sahay
>  wrote:
>
> > Hello,
> >
> > I have a use case like below.
> >
> > USE CASE
> > I have a document with fields like
> >
> > Id,
> > Id_type,
> > Field_1.
> > Filed_2
> >
> > 2 sample messages will look like
> >
> > {
> >   "id": "12334",
> >   "id_type": "USER_RECORD",
> >   "field_1": null,
> >   "field_2": null
> > }
> >
> >
> > {
> >   "id": "31321",
> >   "id_type": "OWNER_RECORD",
> >   "field_1": null,
> >   "field_2": null
> > }
> >
> >
> > QUESTIONS
> >
> > I’d like to define the unique key as a compound key from fields id and
> > id_type
> >
> >   1.  Could someone give me an example of how to do this ? Or point to
> the
> > relevant section in the docs?
> >   2.  Is this the best way to define a compound primary key ? Is there a
> > more efficient way ?
> >
> > Regards,
> > Vivek
> >
>
>
> --
> http://www.the111shift.com
>

Re: Which fieldType to use for JSON Array in Solr 6.5.0?

2019-04-09 Thread David Hastings

Exactly, Solr is a search index, not a data store.  you need to flatten
your relationships.  Right tool for the job etc.

On Tue, Apr 9, 2019 at 4:28 PM Shawn Heisey  wrote:

> On 4/9/2019 2:04 PM, Abhijit Pawar wrote:
> > Hello Guys,
> >
> > I am trying to index a JSON array in one of my collections in mongoDB in
> > Solr 6.5.0 however it is not getting indexed.
> >
> > I am using a DataImportHandler for this.
> >
> > *Here's how the data looks in mongoDB:*
> > {
> >   "idStr" : "5ca38e407b154dac08913a96",
> >  "sampleAttr" : "sampleAttrVal",
> > *"additionalInfo" : [ *
> > *{*
> > *"name" : "Manufacturer",*
> > *"value" : "Videocon"*
> > *}*
> > *]*
> > }
>
> That is not a structure that Solr knows how to handle.  Essentially what
> you have there is one document nested inside another.  Each of Solr's
> documents has a completely flat structure -- there is no possibility of
> a hierarchy within a single document.
>
> Solr does have support for parent/child documents, but it wouldn't be
> indexed like that.  I know almost nothing about how the parent/child
> document support works.  You would have to get help from someone else or
> consult the documentation.
>
> Thanks,
> Shawn
>

Re: Boolean Searches?

2019-03-14 Thread David Hastings

oh, thought it was implied with this:
" and also use the edismax query parser"



On Thu, Mar 14, 2019 at 11:38 AM Andy C  wrote:

> Dave,
>
> You don't mention what query parser you are using, but with the default
> query parser you can field qualify all the terms entered in a text box by
> surrounding them with parenthesis. So if you want to search against the
> 'title' field and they entered:
>
> train OR dragon
>
> You could generate the Solr query:
>
> title:(train OR dragon)
>
> Historically however Solr has not processed queries that contain a mixture
> of boolean operators as expected. The problem is described here:
> http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/
>
> There is an open JIRA for this (
> https://issues.apache.org/jira/browse/SOLR-4023) so I assume the problem
> still exists in the most recent releases.
>
> On Thu, Mar 14, 2019 at 10:50 AM Dave Beckstrom 
> wrote:
>
> > Hi Everyone,
> >
> > I'm building a SOLR search application and the customer wants the search
> to
> > work like google search.
> >
> >
> > They want the user to be able to enter boolean searches like:
> >
> > train OR dragon.
> >
> > which would find any matches that has the word "train" or the word
> "dragon"
> > in the title.
> >
> > I know that the SOLR search would like this:
> >
> > title:train OR title:dragon
> >
> > I am trying to avoid having to parse through what the user enters and
> build
> > out complex search strings.
> >
> > Is there any way that I can build a search against the "title" field
> where
> > if the user enters something like:
> >
> > train OR dragon AND 2
> >
> > it will hour the boolean AND/OR logic without my having to convert it
> into
> > somethng nasty like:
> >
> > title:train OR title:dragon AND title:2
> >
> >
> > Thank you!
> >
> > --
> > *Fig Leaf Software, Inc.*
> > https://www.figleaf.com/
> > 
> >
> > Full-Service Solutions Integrator
> >
> >
> >
> >
> >
> >
> >
>

Re: Boolean Searches?

2019-03-14 Thread David Hastings

If you make your default operator "OR", or the q.op, and also use the
edismax query parser you can use the qf field to boost the title heavily
compared to the default field you are using, for example i use something
like this, which may be over kill:
title^100 description^50 topic^30 text
i also have the same in my pf value as well
but it works for me.

On Thu, Mar 14, 2019 at 10:50 AM Dave Beckstrom 
wrote:

> Hi Everyone,
>
> I'm building a SOLR search application and the customer wants the search to
> work like google search.
>
>
> They want the user to be able to enter boolean searches like:
>
> train OR dragon.
>
> which would find any matches that has the word "train" or the word "dragon"
> in the title.
>
> I know that the SOLR search would like this:
>
> title:train OR title:dragon
>
> I am trying to avoid having to parse through what the user enters and build
> out complex search strings.
>
> Is there any way that I can build a search against the "title" field where
> if the user enters something like:
>
> train OR dragon AND 2
>
> it will hour the boolean AND/OR logic without my having to convert it into
> somethng nasty like:
>
> title:train OR title:dragon AND title:2
>
>
> Thank you!
>
> --
> *Fig Leaf Software, Inc.*
> https://www.figleaf.com/
> 
>
> Full-Service Solutions Integrator
>
>
>
>
>
>
>

1 2 3 >

1 - 100 of 231 matches

Mail list logo