Re: Schema API specifying different analysers for query and index

2021-03-02 Thread Alexandre Rafalovitch
RefGuide gives this for Adding, I would hope the Replace would be similar:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type":{
 "name":"myNewTextField",
 "class":"solr.TextField",
 "indexAnalyzer":{
"tokenizer":{
   "class":"solr.PathHierarchyTokenizerFactory",
   "delimiter":"/" }},
 "queryAnalyzer":{
"tokenizer":{
   "class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema

So, indexAnalyzer/queryAnalyzer, rather than array:
https://lucene.apache.org/solr/guide/8_8/schema-api.html#add-a-new-field-type

Hope this works,
Alex.
P.s. Also check whether you are using matching API and V1/V2 end point.

On Tue, 2 Mar 2021 at 15:25, ufuk yılmaz  wrote:
>
> Hello,
>
> I’m trying to change a field’s query analysers. The following works but it 
> replaces both index and query type analysers:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzer": {
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }
> }
> }
>
> I tried to change analyzer field to analyzers, to specify different analysers 
> for query and index, but it gave error:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzers": [{
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> },{
> "type": "index",
> "tokenizer": {
> "class": "solr.KeywordTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }]
> }
> }
>
> "errorMessages":["Plugin init failure for [schema.xml]
> "msg":"error processing commands",...
>
> How can I specify different analyzers for query and index type when using 
> schema api?
>
> Sent from Mail for Windows 10
>


Re: Multiword synonyms and term wildcards/substring matching

2021-03-02 Thread Alexandre Rafalovitch
I admit to not fully understanding the examples, but ComplexQueryParser
looks like something worth at least reviewing:

https://lucene.apache.org/solr/guide/8_8/other-parsers.html#complex-phrase-query-parser

Also I did not see any references to trying to copyField and process same
content in different ways. If copyField is not stored, the overhead is not
as large.

Regards,
Alex



On Tue., Mar. 2, 2021, 7:08 a.m. Martin Graney, 
wrote:

> Hi All
>
> I have been trying to implement multi word synonyms using `sow=false` into
> a pre-existing system that applied pre-processing to the phrase to apply
> wildcards around the terms, i.e. `bread stick` => `*bread* *stick*`.
>
> I got the synonyms expansion working perfectly, after discovering the
> `preserveOriginal` filter param, but then I needed to re-implement the
> existing wildcard behaviour.
> I tried using the edge-ngram filter, but found that when searching for the
> phrase `bread stick` on a field containing the word `breadstick` and
> `q.op=AND` it returns no results, as the content `breadstick` does not
> _start with_ `stick`. The previous wildcard behaviour would return all
> documents that contain the substrings `bread` AND `stick`, which is the
> desired behaviour.
> I tried using the ngram filter, but this does not support the
> `preserveOriginal`, and so loses a lot of relevance for exact matches, but
> it also results in matches that are far too broad, creating 21 tokens from
> `breadstick` for `minGramSize=3` and `maxGramSize=5` that in practice
> essentially matches all of the documents. Which means that boosts applied
> to other fields, such as 'in stock', push irrelevant documents to the top.
>
> Finally, I tried to strip out ngrams entirely and use subquery/LocalParam
> syntax and local params, a solr feature that is not very well documented.
> I created something like `q={!edismax sow=true v=$widlcards} OR {!edismax
> sow=false v=$plain}` to effectively create a union of results, one with
> multi word synonyms support and one with wildcard support.
> But then I had to implement the other edismax params and immediately
> stumbled.
> Each query in production normally has a slew of `bf` and `bq` params, and I
> cannot see a way to pass these into the nested query using local variables.
> If I have 3 different `bf` params how can I pass them into the local param
> subqueries?
>
> Also, as the search in production is across multiple fields I found passing
> `qf` to both subqueries using dereferencing failed, as the parser saw it as
> a single field and threw a 'number format exception'.
> i.e.
> q={!edismax sow=true v=$tw tf=$tqf} OR {!edismax sow=false v=$tp tf=$tqf}
> $tw=*bread* *stick*
> $tp=bread stick
> $tqf=title^2 desctiption^0.5
>
> As you can guess, I have spent quite some time going down this rabbit hole
> in my attempt to reproduce the existing desired functionality alongside
> multiterm synonyms.
> Is there a way to get multiterm synonyms working with substring matching
> effectively?
> I am sure there is a much simpler way that I am missing than all of my
> attempts so far.
>
> Solr: 8.3
>
> Thanks
> Martin Graney
>
> --
>  
>


Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Alexandre Rafalovitch
Most likely issue is that your core configuration (solrconfig.xml)
does not have the request handler for that. The same config may have
had that in 7.x, but changed since.

More details: 
https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html

Regards,
   Alex.

On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
>
> I am trying out indexing the exampledocs in the examples folder with the
> SimplePostTool on windows 10 using solr 8.8.  All the documents index
> except sample.html. For that file I get the errors below.  I then
> downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> including sample.html.
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\sample.html
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file sample.html (text/html) to [base]/extract
> SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> SimplePostTool: WARNING: Response: 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404 Not Found
> 
> URI:/solr/gettingstarted/update/extract
> STATUS:404
> MESSAGE:Not Found
> SERVLET:default
> 
>
> 
> 
> SimplePostTool: WARNING: IOException while reading response:
> java.io.FileNotFoundException:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> Time spent: 0:00:00.086
> ```
>
> However the json and all other file types index with no problem. For
> example:
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\books.json
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file books.json (application/json) to [base]/json/docs
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> ```
> Just following this tutorial:[
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> ]
>
>   [1]:
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support


Re: Solr 8.0 query length limit

2021-02-18 Thread Alexandre Rafalovitch
Also, investigate if you have repeating conditions and push those into
defaults in custom request handler endpoints (in solrconfig.xml).

Also, Solr supports parameter substitutions, if you have repeated
subconditions.

Regards,
 Alex

On Thu., Feb. 18, 2021, 7:08 a.m. Thomas Corthals, 
wrote:

> You can send big queries as a POST request instead of a GET request.
>
> Op do 18 feb. 2021 om 11:38 schreef Anuj Bhargava :
>
> > Solr 8.0 query length limit
> >
> > We are having an issue where queries are too big, we get no result. And
> if
> > we remove a few keywords we get the result.
> >
> > Error we get - error 414 (Request-URI Too Long)
> >
> >
> > Have made the following changes in jetty.xml, still the same error
> >
> > * > name="solr.jetty.output.buffer.size" default="32768" />*
> > * > name="solr.jetty.output.aggregation.size" default="32768" />*
> > * > name="solr.jetty.request.header.size" default="65536" />*
> > * > name="solr.jetty.response.header.size" default="32768" />*
> > * > name="solr.jetty.send.server.version" default="false" />*
> > * > name="solr.jetty.send.date.header" default="false" />*
> > * > name="solr.jetty.header.cache.size" default="1024" />*
> > * > name="solr.jetty.delayDispatchUntilContent" default="false"/>*
> >
>


Re: How to get case-sensitive Terms?

2021-02-18 Thread Alexandre Rafalovitch
Terms query does not do analysis chain, but expect tokenized values.
Because it matches what is returned by faceting.

So I would check whether that field is string or text and difference in
processing. Enabling debug will also show difference in final expanded
form.

Regards,
Alex
P. S. It is better to start new question threads for new questions. More
people will pay attention.

On Thu., Feb. 18, 2021, 1:31 a.m. elivis,  wrote:

> Alexandre Rafalovitch wrote
> > What about copyField with the target being index only (docValue only?)
> and
> > no lowercase on the target field type?
> >
> > Solr is not a database, you are optimising for search. So duplicate,
> > multi-process, denormalise, create custom field types, etc.
> >
> > Regards,
> >Alex
>
> Thank you!
>
> One more question - when we index data, we have some other fields that we
> are populating. Our data comes from different inputs, so one of those
> fields
> is a data source ID that the text came from. Wen we do search, we are able
> to get search results specific to only that data source by adding filter
> query (e.g. fq=image_id:1). However, that doesn't seem to work when doing a
> terms query - I always get the terms from the entire index. Is there a way
> to filter the terms?
>
> Thank you again.
>
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Meaning of "Index" flag under properties and schema

2021-02-17 Thread Alexandre Rafalovitch
I wonder if looking more directly at the indexes would allow you to
get closer to the problem source.

Have you tried comparing/exploring the indexes with Luke? It is in the
Lucene distribution (not Solr), and there is a small explanation here:
https://mocobeta.medium.com/luke-become-an-apache-lucene-module-as-of-lucene-8-1-7d139c998b2

Regards,
   Alex.

On Wed, 17 Feb 2021 at 16:58, Vivaldi  wrote:
>
> I was getting “illegal argument exception length must be >= 1” when I used 
> significantTerms streaming expression, from this collection and field. I 
> asked about that as a separate question on this list. I will get the whole 
> exception stack trace the next time I am at the customer site.
>
> Why any other field in other collections doesn’t have that flag? We have 
> numerous indexed, non-indexed, docvalues fields in other collections but not 
> that row
>
> Sent from my iPhone
>
> > On 16 Feb 2021, at 20:42, Shawn Heisey  wrote:
> >
> >> On 2/16/2021 9:16 AM, ufuk yılmaz wrote:
> >> I didn’t realise that, sorry. The table is like:
> >> Flags   Indexed Tokenized   Stored  UnInvertible
> >> Properties  YesYesYes Yes
> >> Schema  YesYesYes Yes
> >> Index   YesYesYes NO
> >> Problematic collection has a Index row under Schema row. No other 
> >> collection has it. I was asking about what the “Index” meant
> >
> > I am not completely sure, but I think that row means the field was found in 
> > the actual Lucene index.
> >
> > In the original message you mentioned "weird exceptions" but didn't include 
> > any information about them.  Can you give us those exceptions, and the 
> > requests that caused them?
> >
> > Thanks,
> > Shawn
>


Re: Why Solr questions on stackoverflow get very few views and answers, if at all?

2021-02-12 Thread Alexandre Rafalovitch
I answered quite a bunch a whole ago, as part of book writing process.

I think a lot of them were missing core information like version of Solr.
So they were not very timeless.

The list allows a conversation and multiple perspectives, which is better
than a one shot answer.

Regards,
   Alex

On Fri., Feb. 12, 2021, 5:37 a.m. ufuk yılmaz, 
wrote:

> Is it because the main place for q is this mailing list, or somewhere
> else that I don’t know?
>
> Or Solr isn’t ‘hot’ as some other topics?
>
> Sent from Mail for Windows 10
>
>


Re: Extract a list of the most recent field values?

2021-02-05 Thread Alexandre Rafalovitch
Rewriting:
*) 
https://lucene.apache.org/solr/guide/8_8/json-request-api.html#json-parameter-merging
, there is a way to represent most (all?) of the structure with json.x
parameter.
*) Also, you can create custom Request Handlers in solrconfig.xml with
a lot of those parameters either as defaults (overridable or not) or
as paramsets and refer to them by name with useParams in file or in
query ( 
https://lucene.apache.org/solr/guide/8_8/request-parameters-api.html#the-useparams-parameter)

Sorting: not so sure, but I wonder if you can inject a synthetic field
or function to do 'max/min' on the date even under the category (as a
3rd level nesting)?

Also, another approach that came to me afterwards was to use
relatedness: 
https://lucene.apache.org/solr/guide/8_8/json-facet-api.html#relatedness-and-semantic-knowledge-graphs
and try to identify categories with foreground being recent-date-range
and the background being everything (or relevant subset of such). This
may be cloud-only and totally may not work. I had some education
material related to this at slides 19+ of
https://www.slideshare.net/arafalov/searching-for-ai-leveraging-solr-for-classic-artificial-intelligence-tasks

Regards,
   Alex.

On Fri, 5 Feb 2021 at 14:31, Hullegård, Jimi
 wrote:
>
> Ah, I never thought about grouping on date ranges, and nesting the faceting 
> like that. Interesting! I managed to do a quick test query that seems to give 
> me what I want:
>
> {
> "query": "*:*",
> "filter": "+category:* +modified:[NOW/DAY-60DAYS TO *]",
> "limit": 0,
> "facet": {
> "ranges": {
> "type": "range",
> "field": "modified",
> "start": "NOW/DAY-60DAYS",
> "end": "NOW/DAY",
> "gap": "+7DAY",
> "sort": "index",
> "facet": {
> "categories": {
> "type": "terms",
> "field": "category",
> "limit": -1
> }
> }
> }
> }
> }
>
>
> But... this query, as well as all the examples, are in json query format, in 
> a request body. The actual query will be sent using a custom API that only 
> accepts a regular URL query, with parameters. Any idea how I can rewrite the 
> json query above into a URL query?
>
> Also, it would be easier to use the result if the ranges were sorted by the 
> date in descending order, but no matter what I tried I couldn't get it to 
> work. I thought that "sort": "index desc" should do the trick, but it seems 
> that index sort can't be reversed?
>
>
> Alexandre Rafalovitch wrote:
> >
> > This feels like basic faceting on category, but you are trying to make a 
> > latest record, rather than count as a sorting/grouping principle.
> >
> > How about using JSON Facets?
> >
> > https://lucene.apache.org/solr/guide/8_8/json-facet-api.html
> >
> > I would do the first level as range facet and do your dates at whatever 
> > granularity (say 1 week?) and then second level category terms.
> >
> > Let us know if it works. It is an interesting question.
> >
> Svenskt Näringsliv är företagsamhetens röst i Sverige. Vi samverkar med 50 
> arbetsgivar- och branschorganisationer och är den gemensamma rösten för 60 
> 000 företag med nästan 2 miljoner medarbetare. Vår uppgift är att tala för 
> alla företag och branscher, även de som ännu inte finns men som kan uppstå om 
> förutsättningarna är de rätta. Ett bättre företagsklimat för ett bättre 
> Sverige. Det är vårt uppdrag.
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här 
> kan du läsa mer om vår behandling och dina rättigheter, 
> Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email>


Re: Extract a list of the most recent field values?

2021-02-05 Thread Alexandre Rafalovitch
This feels like basic faceting on category, but you are trying to make
a latest record, rather than count as a sorting/grouping principle.

How about using JSON Facets?
https://lucene.apache.org/solr/guide/8_8/json-facet-api.html
I would do the first level as range facet and do your dates at
whatever granularity (say 1 week?) and then second level category
terms.

Let us know if it works. It is an interesting question.

Regards,
   Alex.

On Fri, 5 Feb 2021 at 08:18, Hullegård, Jimi
 wrote:
>
> Hi,
>
> Say we have a bunch of documents in Solr, and each document has a multi value 
> field "category". Now I would like to get the N most recently used 
> categories, ordered so that the most recently used category comes first and 
> then in falling order.
>
> My simplistic solution to this would be:
>
> 1. Perform a search for all documents with at least one category set, sorted 
> by last modified date
> 2. Iterate over the search results, extracting the categories used, and add 
> them to a list
> 3. If we have N number of categories, stop iterating the results
> 4. If there isn't enough categories, go to the next page in the search 
> results and start over at step 2 above, until N categories are found or no 
> more search results
>
> But this doesn't seem very efficient. Especially if there are lots of 
> documents, and N is a high number and/or only a handful of categories are 
> used most of the time, since it could mean one has to look through a whole 
> lot of documents before having enough categories. Worst case scenario: N is 
> higher than the total number of unique categories used, in which case one 
> would iterate over every single document that has a category.
>
> Is there a way one can construct some special query to solr to get this data 
> in a more efficient way?
>
> Regards
> /Jimi
>
> Svenskt Näringsliv är företagsamhetens röst i Sverige. Vi samverkar med 50 
> arbetsgivar- och branschorganisationer och är den gemensamma rösten för 60 
> 000 företag med nästan 2 miljoner medarbetare. Vår uppgift är att tala för 
> alla företag och branscher, även de som ännu inte finns men som kan uppstå om 
> förutsättningarna är de rätta. Ett bättre företagsklimat för ett bättre 
> Sverige. Det är vårt uppdrag.
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här 
> kan du läsa mer om vår behandling och dina rättigheter, 
> Integritetspolicy


Re: 404 Errors on update/extract

2021-02-05 Thread Alexandre Rafalovitch
Hi Leon,

Feel free to create JIRA issue
https://issues.apache.org/jira/secure/Dashboard.jspa
and then do Github pull request to fix the example name.  The
documentation is in asciidoc format at:
https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide/src
with names matching those on the server.

This could be a great issue to cut your teeth on with helping Solr :-)

Regards,
   Alex.

On Fri, 5 Feb 2021 at 10:35, nq  wrote:
>
> Hi Alex,
>
>
> Thanks a lot for your help!
>
> I have tested the same using the 'techproducts' example as proposed, and
> it worked fine.
>
>
> You are right, the documentation seems to be outdated in this aspect.
>
> I have just reviewed the solrconfig.xml of the 'schemaless' example and
> found all the Solr Cell config was completely missing.
>
> After adding it as described at
>
> https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml
>
> everything worked fine again.
>
>
> What can I do to help updating the docs?
>
>
> Best regards,
>
> Leon
>
>
> Am 05.02.21 um 16:15 schrieb Alexandre Rafalovitch:
> > I think the extract handler is not defined in schemaless. This may be
> > a change from before and the documentation is out of sync.
> >
> > Can you try 'techproducts' example instead of schemaless:
> > bin/solr stop (if you are still running it)
> > bin/solr start -e techproducts
> >
> > Then the import command.
> >
> > The Tika integration is defined in solrconfig.xml and needs both
> > handler defined and some libraries loaded. Once you confirmed you like
> > what you see, you can copy those into whatever configuration you are
> > working with.
> >
> > Regards,
> > Alex.
> >
> > On Fri, 5 Feb 2021 at 07:38, nq  wrote:
> >> Hi,
> >>
> >>
> >> I am new to Solr and tried to follow the guide to upload PDF data using
> >> Tika, on Solr 8.7.0 (running on Debian 10):
> >>
> >> https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html
> >>
> >> but I get an HTTP 404 error when trying to import the file.
> >>
> >>
> >> In the solr installation directory, after spinning up the example server
> >> using
> >>
> >> solr/bin/solr -e schemaless
> >>
> >> I firstly used the Post Tool to index a PDF file as described in the
> >> guide, giving the following output (paths truncated using “[…]” for
> >> privacy reasons):
> >>
> >> bin/post -c gettingstarted example/exampledocs/solr-word.pdf -params
> >> "literal.id=doc1"
> >>
> >>> java -classpath /[…]/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes
> >>> -Dparams=literal.id=doc1 -Dc=gettingstarted -Ddata=files org.apa
> >>> che.solr.util.SimplePostTool example/exampledocs/solr-word.pdf
> >>> SimplePostTool version 5.0.0
> >>> Posting files to [base] url
> >>> http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
> >>> Entering auto mode. File endings considered are
> >>> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> >>> POSTing file solr-word.pdf (application/pdf) to [base]/extract
> >>> SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
> >>> url:
> >>> http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1
> >>> esource.name=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
> >>> SimplePostTool: WARNING: Response: 
> >>> 
> >>> 
> >>> Error 404 Not Found
> >>> 
> >>> HTTP ERROR 404 Not Found
> >>> 
> >>> URI:/solr/gettingstarted/update/extract
> >>> STATUS:404
> >>> MESSAGE:Not Found
> >>> SERVLET:default
> >>> 
> >>>
> >>> 
> >>> 
> >>> SimplePostTool: WARNING: IOException while reading response:
> >>> java.io.FileNotFoundException:
> >>> http://localhost:8983/solr/gettingstarted/update/extract
> >>> ?literal.id=doc1=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
> >>>
> >>> 1 files indexed.
> >>> COMMITting Solr index changes to
> >>> http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
> >>> Time spent: 0:00:00.038
> >> resulting in no actual changes being visible in the Solr.
> >>
> >>
> >> Using curl results in the same HTTP response:
> >>
> >>> curl
> >>> 'http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1=true'
> >>> -F "myfile=@example
> >>> /exampledocs/solr-word.pdf"
> >>> 
> >>> 
> >>> 
> >>> Error 404 Not Found
> >>> 
> >>> HTTP ERROR 404 Not Found
> >>> 
> >>> URI:/solr/gettingstarted/update/extract
> >>> STATUS:404
> >>> MESSAGE:Not Found
> >>> SERVLET:default
> >>> 
> >>>
> >>> 
> >>> 
> >>>
> >> Sorry if this has already been discussed somewhere; I have not been able
> >> to find anything helpful yet.
> >>
> >> Thank you!
> >>
> >> Leon
> >>


Re: 404 Errors on update/extract

2021-02-05 Thread Alexandre Rafalovitch
I think the extract handler is not defined in schemaless. This may be
a change from before and the documentation is out of sync.

Can you try 'techproducts' example instead of schemaless:
bin/solr stop (if you are still running it)
bin/solr start -e techproducts

Then the import command.

The Tika integration is defined in solrconfig.xml and needs both
handler defined and some libraries loaded. Once you confirmed you like
what you see, you can copy those into whatever configuration you are
working with.

Regards,
   Alex.

On Fri, 5 Feb 2021 at 07:38, nq  wrote:
>
> Hi,
>
>
> I am new to Solr and tried to follow the guide to upload PDF data using
> Tika, on Solr 8.7.0 (running on Debian 10):
>
> https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html
>
> but I get an HTTP 404 error when trying to import the file.
>
>
> In the solr installation directory, after spinning up the example server
> using
>
> solr/bin/solr -e schemaless
>
> I firstly used the Post Tool to index a PDF file as described in the
> guide, giving the following output (paths truncated using “[…]” for
> privacy reasons):
>
> bin/post -c gettingstarted example/exampledocs/solr-word.pdf -params
> "literal.id=doc1"
>
> > java -classpath /[…]/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes
> > -Dparams=literal.id=doc1 -Dc=gettingstarted -Ddata=files org.apa
> > che.solr.util.SimplePostTool example/exampledocs/solr-word.pdf
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
> > Entering auto mode. File endings considered are
> > xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file solr-word.pdf (application/pdf) to [base]/extract
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
> > url:
> > http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1
> > esource.name=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
> > SimplePostTool: WARNING: Response: 
> > 
> > 
> > Error 404 Not Found
> > 
> > HTTP ERROR 404 Not Found
> > 
> > URI:/solr/gettingstarted/update/extract
> > STATUS:404
> > MESSAGE:Not Found
> > SERVLET:default
> > 
> >
> > 
> > 
> > SimplePostTool: WARNING: IOException while reading response:
> > java.io.FileNotFoundException:
> > http://localhost:8983/solr/gettingstarted/update/extract
> > ?literal.id=doc1=%2F[…]%2Fsolr-8.7.0%2Fexample%2Fexampledocs%2Fsolr-word.pdf
> >
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update?literal.id=doc1...
> > Time spent: 0:00:00.038
> resulting in no actual changes being visible in the Solr.
>
>
> Using curl results in the same HTTP response:
>
> > curl
> > 'http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc1=true'
> > -F "myfile=@example
> > /exampledocs/solr-word.pdf"
> > 
> > 
> > 
> > Error 404 Not Found
> > 
> > HTTP ERROR 404 Not Found
> > 
> > URI:/solr/gettingstarted/update/extract
> > STATUS:404
> > MESSAGE:Not Found
> > SERVLET:default
> > 
> >
> > 
> > 
> >
>
> Sorry if this has already been discussed somewhere; I have not been able
> to find anything helpful yet.
>
> Thank you!
>
> Leon
>


Re: How to get case-sensitive Terms?

2021-02-03 Thread Alexandre Rafalovitch
What about copyField with the target being index only (docValue only?) and
no lowercase on the target field type?

Solr is not a database, you are optimising for search. So duplicate,
multi-process, denormalise, create custom field types, etc.

Regards,
   Alex

On Wed., Feb. 3, 2021, 4:43 p.m. elivis,  wrote:

> Alexandre Rafalovitch wrote
> > It is documented in the reference guide:
> > https://lucene.apache.org/solr/guide/8_8/analysis-screen.html
> >
> > Hope it helps,
> >Alex.
> >
> > On Tue, 2 Feb 2021 at 00:57, elivis 
>
> > elivis@
>
> >  wrote:
> >>
> >> Alexandre Rafalovitch wrote
> >> > Admin UI also allows you to run text string against a field definition
> >> to
> >> > see what each stage of analyzer chain does.
> >>
> >> Thank you. Could please let me know how to do this (see what each stage
> >> of
> >> analyzer chain does)?
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
> Thank you, Alex! We were indeed using the LowerCaseFilterFactory on the
> text
> field that I'm using, and if I remove it from the schema, I do indeed get
> case sensitive terms. However, I don't think I can just remove the
> LowerCaseFilterFactory and call it a day. The reason we are using it is
> because we want our "exact match" searches to NOT be case sensitive - a
> search for "John" should return hits for "John" or "john". Is there a way
> to
> achieve this result in an efficient manner, if I remove the
> LowerCaseFilterFactory?
>
> Thank you again.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: How to get case-sensitive Terms?

2021-02-02 Thread Alexandre Rafalovitch
It is documented in the reference guide:
https://lucene.apache.org/solr/guide/8_8/analysis-screen.html

Hope it helps,
   Alex.

On Tue, 2 Feb 2021 at 00:57, elivis  wrote:
>
> Alexandre Rafalovitch wrote
> > Admin UI also allows you to run text string against a field definition to
> > see what each stage of analyzer chain does.
>
> Thank you. Could please let me know how to do this (see what each stage of
> analyzer chain does)?
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Apache Solr Reference Guide isn't accessible

2021-02-01 Thread Alexandre Rafalovitch
And if you need something more recent while this is being fixed, you
can look right at the source in GitHub, though a navigation, etc is
missing:
https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/analyzers.adoc

Open Source :-)

Regards,
   Alex.

On Mon, 1 Feb 2021 at 15:04, Mike Drob  wrote:
>
> Hi Dorion,
>
> We are currently working with our infra team to get these restored. In the
> meantime, the 8.4 guide is still available at
> https://lucene.apache.org/solr/guide/8_4/ and are hopeful that the 8.8
> guide will be back up soon. Thank you for your patience.
>
> Mike
>
> On Mon, Feb 1, 2021 at 1:58 PM Dorion Caroline 
> wrote:
>
> > Hi,
> >
> > I can't access to Apache Solr Reference Guide since few days.
> > Example:
> > URL
> >
> >   *   https://lucene.apache.org/solr/guide/8_8/
> >   *   https://lucene.apache.org/solr/guide/8_7/
> > Result:
> > Not Found
> > The requested URL was not found on this server.
> >
> > Do you know what going on?
> >
> > Thanks
> > Caroline Dorion
> >


Re: How to get case-sensitive Terms?

2021-01-30 Thread Alexandre Rafalovitch
Check the field type and associated indexing chain in managed-schema of
your core. It probably has the lowercase filter in it.

Find a better type or make one yourself. Remember to reload the schema and
reindex the content.

Admin UI also allows you to run text string against a field definition to
see what each stage of analyzer chain does.

Regards,
Alex

On Sat., Jan. 30, 2021, 12:59 p.m. elivis,  wrote:

> I'm using Terms Component functionality
> (https://lucene.apache.org/solr/guide/8_4/the-terms-component.html) to get
> all terms from an index. However, I need the terms to be in the original
> case lettering (e.g. "TeSt"). So far I am only able to get lowercased terms
> (i.e. "test" instead of "TeSt").
>
> Can somebody please let me know if this is possible, and if so, how to do
> this?
>
> Thank you!
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Multi-select faceting for nested documents

2021-01-25 Thread Alexandre Rafalovitch
I don't have an answer, but I feel that maybe explaining the situation
in more details would help a bit more. Specifically, you explain your
data structure well, but not your actual presentation requirement in
enough details.

How would you like the multi-select to work, how it is working for you
now and what is the gap?

Regards,
   Alex.
P.s. Sometimes, you have to really modify the way the information is
stored in Solr for the efficient and effective search results. Solr is
not the database, so it needs to model search requirements, rather
than original data shape.

On Mon, 25 Jan 2021 at 10:34, Lance Snell  wrote:
>
> Any examples would be greatly appreciated.
>
> On Mon, Jan 25, 2021, 2:25 AM Lance Snell  wrote:
>
> > Hey all,
> >
> > I am having trouble finding current examples of multi-select faceting for
> > nested documents.  Specifically ones with *multiple *levels of nested
> > documents.
> >
> > My current schema has a parent document, two child documents(siblings),
> > and a grandchild document.  I am using the JSON API.
> >
> > Product -> Sku -> Price
> >|
> >\/
> > StoreCategory
> >
> > Any help/direction would be appreciated.
> >
> >
> > Solr. 8.6
> >
> > --
> > Thanks,
> >
> > Lance
> >


Re: Exact matching without using new fields

2021-01-21 Thread Alexandre Rafalovitch
If, during index time, your "information" and "informed" are tokenized
into the same root (inform?), then you will not be able to distinguish
them without storing original forms somewhere, usually with copyField.
Same with information vs INFORMATION. The search happens based on
indexed tokens. Which you can test in Admin UI by seeing how your text
is indexed.

But if you do store the form you want to find, you have several
options with eDisMax pf2/pf3, or with Surround Query Parser.

Regards,
   Alex.

On Tue, 19 Jan 2021 at 15:02, gnandre  wrote:
>
> Thanks for replying, Dave.
>
> I am afraid that I am looking for non-index time i.e. query time solution.
>
> Actually in my case I am expecting both documents to be returned from your
> example. I am just trying to avoid returning of documents which contain a
> tokenized versions
> of the provided search query when it is enclosed within double quotes to
> indicate exact matching expectation.
>
> e.g.
> search query -> "information retrieval"
>
> This should match documents like following:
> doc 1: "information retrieval"
> doc 2: "Advanced information retrieval with Solr"
>
> but should NOT match documents like
> doc 3: "informed retrieval"
> doc 4: "information extraction"  (considering 'extraction' was a specified
> synonym of 'retrieval' )
> doc 5: "INFORMATION RETRIEVAL"
>
> etc
>
> I am also ok with these documents showing up as long as they show up at
> bottom. Also, query time solution is a must.
>
> On Tue, Jan 19, 2021 at 12:22 PM David R  wrote:
>
> > We had the same requirement. Just to echo back your requirements, I
> > understand your case to be this. Given these 2 doc titles:
> >
> > doc 1: "information retrieval"
> > doc 2: "Advanced information retrieval with Solr"
> >
> > You want a phrase search for "information retrieval" to find both
> > documents, but an EXACT phrase search for "information retrieval" to find
> > doc #1 only.
> >
> > If that's true, and case-sensitive search isn't a requirement, I indexed
> > this in the token stream, with adjacent positions of course.
> >
> > START information retrieval END
> > START advanced information retrieval with solr END
> >
> > And with our custom query parser, when an EXACT operator is found, I
> > tokenize the query to match the first case. Otherwise pass it through.
> >
> > Needs custom analyzers on the query and index sides to generate the
> > correct token sequences.
> >
> > It's worked out well for our case.
> >
> > Dave
> >
> >
> >
> > 
> > From: gnandre 
> > Sent: Tuesday, January 19, 2021 4:07 PM
> > To: solr-user@lucene.apache.org 
> > Subject: Exact matching without using new fields
> >
> > Hi,
> >
> > I am aware that to do exact matching (only whatever is provided inside
> > double quotes should be matched) in Solr, we can copy existing fields with
> > the help of copyFields into new fields that have very minimal tokenization
> > or no tokenization (e.g. using KeywordTokenizer or using string field type)
> >
> > However this solution is expensive in terms of index size because it might
> > almost double the size of the existing index.
> >
> > Is there any inexpensive way of achieving exact matches from the query
> > side. e.g. boost the original tokens more at query time compared to their
> > tokens?
> >


Re: [Solr8.7] Chinese ZH language ?

2021-01-10 Thread Alexandre Rafalovitch
>possible analysis error: cannot change field "tizh" from

You have content indexed against old incompatible definition. Deleted but
not purged records count.

Delete your index data or change field name during testing.

Regards,
Alex
On Sun., Jan. 10, 2021, 9:19 a.m. Bruno Mannina,  wrote:

> Hello,
>
>
>
> I would like to index simplified chinese ZH language (i.e. 一种新型太阳能坪
> 床增温系统),
>
> I added in my solrconfig the lib:
>
>  dir="${solr.install.dir:../../..}/contrib/analysis-extras/lucene-libs/"
> regex="lucene-analyzers-smartcn-8\.7\.0\.jar" />
>
>
>
> First question: Is it enough ?
>
>
>
> But now I need your help to define the fieldtype “text_zh” in my
> schema.xml to use with:
>
> (PS: As other fields, I need highlight)
>
>
>
>  stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
> And
>
>
>
> 
>
> 
>
>  positionIncrementGap="100">
>
>   
>
>
>
>
>
>
>   words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>
>
>
>
>
>   
>
> 
>
>
>
> No error, when I reload my core.
>
>
>
> But I can’t index Chinese data, I get this error:
>
>
>
> POSTing file CN-0005.xml (application/xml) to [base]
>
> SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url:
> http:///solr/yyy/update
>
> SimplePostTool: WARNING: Response: 
>
> 
>
>
>
> 
>
>   400
>
>   1
>
> 
>
> 
>
>   
>
> org.apache.solr.common.SolrException
>
> java.lang.IllegalArgumentException
>
>   
>
>   Exception writing document id CN112091782A to the index;
> possible analysis error: cannot change field "tizh" from index options=DOCS
> to inconsistent index options=DOCS_AND_FREQS_AND_POSITIONS
>
>   400
>
> 
>
> 
>
> SimplePostTool: WARNING: IOException while reading response:
> java.io.IOException: Server returned HTTP response code: 400 for URL:
> http:///solr/yyy/update
>
>
>
> Thanks a lot for your help,
>
> Bruno
>
>
>
>
>
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>


Re: DIH and UUIDProcessorFactory

2020-12-17 Thread Alexandre Rafalovitch
Try with the explicit URP chain too. It may work as well.

Regards,
   Alex.

On Thu, 17 Dec 2020 at 16:51, Dmitri Maziuk  wrote:
>
> On 12/12/2020 4:36 PM, Shawn Heisey wrote:
> > On 12/12/2020 2:30 PM, Dmitri Maziuk wrote:
> >> Right, ```Every update request received by Solr is run through a chain
> >> of plugins known as Update Request Processors, or URPs.```
> >>
> >> The part I'm missing is whether DIH's ' >> name="/dataimport"' counts as an "Update Request", my reading is it
> >> doesn't and URP chain applies only to ' >
> > If you define an update chain as default, then it will be used for all
> > updates made where a different chain is not specifically requested.
> >
> > I have used this personally to have my custom update chain apply even
> > when the indexing comes from DIH.  I know for sure that this works on
> > 4.x and 5.x versions; it should work on newer versions as well.
> >
>
> Confirmed w/ 8.7.0: I finally got to importing the one DB where I need
> this, and UUIDs are there with the default URP chain.
>
> Thank you
> Dima
>
>


Re: DIH and UUIDProcessorFactory

2020-12-12 Thread Alexandre Rafalovitch
Why not? You should be able to put an URP chain after DIH, the usual way.

Is that something about UUID that is special?

Regards,
Alex

On Sat., Dec. 12, 2020, 2:55 p.m. Dmitri Maziuk, 
wrote:

> Hi everyone,
>
> is there an easy way to use the stock UUID generator with DIH? We have a
> hand-written one-liner class we use as DIH entity transformer but I
> wonder if there's a way to use the built-in UUID generator class instead.
>
>  From the TFM it looks like there isn't, is that correct?
>
> TIA,
> Dmitri
>


Re: is there a way to trigger a notification when a document is deleted in solr

2020-12-07 Thread Alexandre Rafalovitch
Maybe a postCommit listener?
https://lucene.apache.org/solr/guide/8_4/updatehandlers-in-solrconfig.html

Regards,
   Alex.

On Mon, 7 Dec 2020 at 08:03, Pushkar Mishra  wrote:
>
> Hi All,
>
> Is there a way to trigger a notification when a document is deleted in
> solr? Or may be when auto purge gets complete of deleted documents in solr?
>
> Thanks
>
> --
> Pushkar Kumar Mishra
> "Reactions are always instinctive whereas responses are always well thought
> of... So start responding rather than reacting in life"


Re: chaining charFilter

2020-12-02 Thread Alexandre Rafalovitch
Did you reload the core for it to notice the new schema? Or try creating a
new core from the same schema?

If it is a SolrCloud, you also have to upload the schema to the Zookeeper.

Regards,
   Alex.

On Wed, 2 Dec 2020 at 09:19, Arturas Mazeika  wrote:

> Hi Solr-Team,
>
> The manual of charfilters says that one can chain them: (from
> https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.MappingCharFilterFactory
> ):
>
> CharFilters can be chained like Token Filters and placed in front of a
> Tokenizer. CharFilters can add, change, or remove characters while
> preserving the original character offsets to support features like
> highlighting.
>
> I am trying to filter out some of the chars from some fields, so I can do
> an efficient and effective faceting later. I tried to chaing charfilters
> for that purpose:
>
>  positionIncrementGap="100">
> 
> 
>  pattern="(.*[/\\])([^/\\]+)$"   replacement="$2"/>
>  pattern="([0-9\-]+)T([0-9\-]+)" replacement="$1 $2"/>
>  pattern="[^a-zA-Z]+"replacement=" "/>
>
> 
> 
> 
> 
>  stored="true"/>
>
> but in schema definition I see only the last charfilter
> [image: image.png]
>
> Any clues why?
>
> Cheers,
> Arturas
>


Re: Trouble with post.jar

2020-11-05 Thread Alexandre Rafalovitch
Are you sure you have the request handler for /update/extract defined
in your solrconfig.xml?
Not all the update request handlers are defined explicitly (you can
check with Config API - /solr/hadoopDocs/config/requestHandler), but I
am 99% sure that the /update/extract would be explicit because it
needs Tika, which means a library statement to load the jar as well.

The latest Solr does not have this handler in the default
configuration, so if you bootstrapped from that, this is the most
likely cause. But the non-default techproducts one does. So, you could
copy the lib directive (contrib/extraction) and the request handler
(/update/extract) to your config's solrconfig.xml and - after
restarting the core - it may work.

Regards,
   Alex.
P.s. The relevant solrconfig.xml is in
solr-8.6.1/server/solr/configsets/sample_techproducts_configs/conf ,
but make sure to not modify things anywhere in that path, just copy
from it.

On Thu, 5 Nov 2020 at 11:40, Bruce Campbell
 wrote:
>
> Thanks for your reply. I am using the Solr (or lucene) web site as a test 
> site so my collection name is "solr". I think the first solr is part of the 
> part of the url that the solr application uses while the second one is the 
> name of the collection. Here is the same message when I tried to use a 
> collection called hadoopDocs:
>
> SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: 
> http://localhost:8983/solr/hadoopDocs/update/extract?commit=true
>
> If I am wrong, please correct me.
>
> Thanks again for your reply,
> Bruce
> -Original Message-
> From: Vincenzo D'Amore 
> Sent: Thursday, November 5, 2020 9:42 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with post.jar
>
> I see there are two solr in the url path, very likely you typed the wrong 
> Solr host parameter
>
> http://localhost:8983/solr/solr/update/extract?commit=true
>
> Ciao,
> Vincenzo
>
> --
> mobile: 3498513251
> skype: free.dev
>
> > On 5 Nov 2020, at 16:27, Bruce Campbell  
> > wrote:
> >
> > http://localhost:8983/solr/solr/update/extract?commit=true


Re: Possible to add a default "appends" fq except for queries in the admin GUI?

2020-10-22 Thread Alexandre Rafalovitch
Why not have a custom handler endpoint for your online queries? You
will be modifying them anyway to remove fq.

Or even create individual endpoints for every significant use-case.
You can share the configuration between them with initParams or
useParams, but have more flexibility going forward.

Admin UI allows you to change /select, but - like you said - manually
and every time.

Regards,
  Alex.

On Thu, 22 Oct 2020 at 14:18, Batanun B  wrote:
>
> Hi,
>
> We have multiple components that uses the Solr search feature on our 
> websites. But we have some documents in the index that we never want to 
> display in the search results (nothing secret or anything, just uninteresting 
> for the user to see). So far, we have added a fq to all our queries, that 
> filters out these documents. But we would like to not have to do this, since 
> there is always a risk of us forgetting to add that fq parameter.
>
> So, today i tried adding this fq in a "appends" list in the standard 
> requestHandler. I needed to add it to the standard one, since that's the one 
> that all the search components use (ie, no qt parameter defined, and i would 
> prefer not to have to change that). That worked fine. Until I needed to do a 
> query in the solr admin GUI, and realized that this filter query was used 
> there too, effectively hiding a bunch of documents that I as an administrator 
> need to see.
>
> Is there a way to avoid this problem? Can I somehow configure Solr to not use 
> this filter query when in the admin GUI? If I define a separate request 
> handler in solrconfig, can i make the admin GUI always use this by default? I 
> don't want to have to manually change the request handler in the admin GUI 
> every time.
>
> What I tried so far:
>
> * Adding the fq in the "appends" in the standard request handler, as 
> mentioned above. Causing the filter to always be in effect, even in admin GUI
> * Keeping the configuration as above, but also adding a request handler with 
> name="/select", that doesn't have this fq defined. Then the filter was never 
> applied, not in admin GUI and not on any website search


Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread Alexandre Rafalovitch
I think this is all explained quite well in the Ref Guide:
https://lucene.apache.org/solr/guide/8_6/docvalues.html

DocValues is a different way to index/store values. Faceting is a
primary use case where docValues are better than what 'indexed=true'
gives you.

Regards,
   Alex.

On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
>
>
> Hey all,
>
> From my little experiments, I see that (if I didn't make a stupid mistake) we 
> can facet on fields marked as both indexed and stored being false:
>
>  stored="false" docValues="true"/>
>
> I'm suprised by this, I thought I would need to index it. Can you confirm 
> this?
>
> Regards
>
> --
> uyilmaz 


Re: converting string to solr.TextField

2020-10-16 Thread Alexandre Rafalovitch
Just as a side note,

> indexed="true"
If you are storing 32K message, you probably are not searching it as a
whole string. So, don't index it. You may also want to mark the field
as 'large' (and lazy):
https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties

When you are going to make it a text field, you will probably be
having the same issues as well.

And honestly, if you are not storing those fields to search, maybe you
need to consider the architecture. Maybe those fields do not need to
be in Solr at all, but in external systems. Solr (or any search
system) should not be your system of records since - as the other
reply showed - some of the answers are "reindex everything".

Regards,
   Alex.

On Fri, 16 Oct 2020 at 14:02, yaswanth kumar  wrote:
>
> I am using solr 8.2
>
> Can I change the schema fieldtype from string to solr.TextField
> without indexing?
>
> 
>
> The reason is that string has only 32K char limit where as I am looking to
> store more than 32K now.
>
> The contents on this field doesn't require any analysis or tokenized but I
> need this field in the queries and as well as output fields.
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com


Re: Solr 8.6.3

2020-10-15 Thread Alexandre Rafalovitch
Why not do an XSLT transformation on it before it hits Solr.

Or during if it really has to be in-Solr for some reason
https://lucene.apache.org/solr/guide/8_6/uploading-data-with-index-handlers.html#using-xslt-to-transform-xml-index-updates

But you have more options outside as you could use XQuery instead.

As long as final XML is in Solr format, you are good to go.

Regards,
Alex

On Thu., Oct. 15, 2020, 4:13 p.m. Kris Gurusamy, <
krishnan.gurus...@xpanse.com> wrote:

> I've just downloaded solr 8.6.3 and trying to create DIH for loading
> structured XML. I found out that DIH will be deprecated soon with version
> 9.0. What is the equivalent of DIH in new solr version? How do I import
> structured XML data which is very custom and index in Solr new version? Any
> help is appreciated.
>
> Regards
>
> Kris Gurusamy
> Director, Engineering
> kgurus...@xpanse.com
> www.xpanse.com
>
> On 10/15/20, 1:08 PM, "Anshum Gupta (Jira)"  wrote:
>
>
>  [
> https://issues.apache.org/jira/browse/SOLR-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Anshum Gupta resolved SOLR-14938.
> -
> Resolution: Invalid
>
> [~krisgurusamy] - Please ask questions regarding usage on the Solr
> user mailing list.
>
> JIRA is meant for issue tracking purposes.
>
> > Solr 8.6.3
> > --
> >
> > Key: SOLR-14938
> > URL:
> https://issues.apache.org/jira/browse/SOLR-14938
> > Project: Solr
> >  Issue Type: Bug
> >  Security Level: Public(Default Security Level. Issues are
> Public)
> >  Components: contrib - DataImportHandler
> >Reporter: Krishnan
> >Priority: Major
> >
> > I've just downloaded solr 8.6.3 and trying to create DIH for loading
> structured XML. I found out that DIH will be deprecated soon with version
> 9.0. What is the equivalent of DIH in new solr version? How do I import
> structured XML data which is very custom and index in Solr new version? Any
> help is appreciated.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
>


Re: solr-8983.pid: Permission denied

2020-10-15 Thread Alexandre Rafalovitch
If the .pid file is not at that location, then I would investigate
where that file is instead (after Solr is started).

If it is in a different location, then you have different environment
expectations, somehow. This, in all honesty, would still be consistent
with my theory that Solr was started somehow differently (perhaps just
this once).

If it is nowhere, then you may have a permission issue around creating
that file in the first place.

Basically, I am saying that maybe the issue you have is a symptom of a
deeper discrepancy rather than the actual issue to solve directly.

Regards,
   Alex.

On Thu, 15 Oct 2020 at 11:03, Ryan W  wrote:
>
> The .pid file referenced in the "Permission denied" message does not exist.
>
> On Thu, Oct 15, 2020 at 11:01 AM Ryan W  wrote:
>
> > I have been starting solr like so...
> >
> > service solr start
> >
> >
> > On Thu, Oct 15, 2020 at 10:31 AM Joe Doupnik  wrote:
> >
> >>  Alex has it right. In my environment I created user "solr" in group
> >> "users". Then I ensured that "solr:user" owns all of Solr's files. In
> >> addition, I do Solr start/stop with an /etc/init.d script (the Solr
> >> distribution has the basic one which we can embellish) in which there is
> >> control line RUNAS="solr". The RUNAS variable is used to properly start
> >> Solr.
> >>  Thanks,
> >>  Joe D.
> >>
> >> On 15/10/2020 15:02, Alexandre Rafalovitch wrote:
> >> > It sounds like maybe you have started the Solr in a different way than
> >> > you are restarting it. E.g. maybe you started it manually (bin/solr
> >> > start, probably as a root) but are trying to restart it via service
> >> > script. Who owned the .pid file? I am guessing 'root', while the
> >> > service script probably runs as a different (lower-permission) user.
> >> >
> >> > The practical effect of that assumption is that your environmental
> >> > variables were set differently and various things (e.g. logs) may not
> >> > be where you expect.
> >> >
> >> > The solution is to be consistent in using the service to
> >> > start/restart/stop your Solr.
> >> >
> >> > Regards,
> >> > Alex.
> >> >
> >> > On Thu, 15 Oct 2020 at 09:51, Ryan W  wrote:
> >> >> What is my permissions problem here:
> >> >>
> >> >> [root@faspbsy0002 bin]# service solr restart
> >> >> Sending stop command to Solr running on port 8983 ... waiting up to 180
> >> >> seconds to allow Jetty process 38947 to stop gracefully.
> >> >> /opt/solr/bin/solr: line 2125: /opt/solr/bin/solr-8983.pid: Permission
> >> >> denied
> >> >>
> >> >> What is the practical effect if Solr can't write this solr-8983.pid
> >> file?
> >> >> What user should own the contents of /opt/solr/bin ?
> >> >>
> >> >> Thanks
> >>
> >>


Re: Data Impor Handlert

2020-10-15 Thread Alexandre Rafalovitch
Solr now has package managers and DIH is one of the packages to reflect the
fact that its development cycle is not locked to Solr's and to reduce core
download. Tika may be heading the same way, as running Tika inside the Solr
process could cause memory issues with complex PDFs.

In terms of other ways of pre-process and load data into Solr, there are
things like:
1) Apache Camel https://camel.apache.org/
2) Apache NiFi https://nifi.apache.org/

Other commercial solutions also exist, such as StreamSets:
3)
https://streamsets.com/documentation/datacollector/latest/help//datacollector/UserGuide/Destinations/Solr.html

And, of course, you can always roll your own with SolrJ.

Regards,
  Alex.



On Thu, 15 Oct 2020 at 10:08, DINSD | SPAutores
 wrote:

> Hi
>
> Based on this document there are two ways to index document on the Solr
> platform, https://lucidworks.com/post/indexing-with-solrj/
>
> Quote:
> "Two popular methods of indexing existing data are the Data Import Handler
> (DIH) and Tika (Solr Cell)/ExtractingRequestHandler"
>
> Now that DHI has been discontinued, only supported by a community package,
> are there any other options?
>
> Best regards
> *Rui Pimentel*
>
> *Rui Pimentel*
>
>
> * DINSD - Departamento de Informática / SPA Digital*
> Av. Duque de Loulé, 31 - 1069-153 Lisboa  PORTUGAL
> *T * (+ 351) 21 359 44 36 */* (+ 351) 21 359 44 00  *F* (+ 351) 21 353 02
> 57
> <%7bmailsector...@spautores.pt> informat...@spautores.pt
>  www.SPAutores.pt
> 
> 
> 
> 
> Please consider the environment before printing this email
>
> Esta mensagem electrónica, incluindo qualquer dos seus anexos, contém
> informação PRIVADA, CONFIDENCIAL e de DIVULGAÇÃO PROIBIDA,e destina-se
> unicamente à pessoa e endereço electrónico acima indicados. Se não for o
> destinatário desta mensagem, agradecemos que a elimine e nos comunique de
> imediato através do telefone  +351 21 359 44 00 ou por email para:
> ge...@spautores.pt
>
> This electronic mail transmission including any attachment hereof,
> contains information that is PRIVATE, CONFIDENTIAL and PROTECTED FROM
> DISCLOSURE, and it is only for the use of the person and the e-mail address
> above indicated. If you have received this electronic mail transmission in
> error, please destroy it and notify us immediately through the telephone
> number  +351 21 359 44 00 or at the e-mail address:  ge...@spautores.pt
>
>


Re: solr-8983.pid: Permission denied

2020-10-15 Thread Alexandre Rafalovitch
It sounds like maybe you have started the Solr in a different way than
you are restarting it. E.g. maybe you started it manually (bin/solr
start, probably as a root) but are trying to restart it via service
script. Who owned the .pid file? I am guessing 'root', while the
service script probably runs as a different (lower-permission) user.

The practical effect of that assumption is that your environmental
variables were set differently and various things (e.g. logs) may not
be where you expect.

The solution is to be consistent in using the service to
start/restart/stop your Solr.

Regards,
   Alex.

On Thu, 15 Oct 2020 at 09:51, Ryan W  wrote:
>
> What is my permissions problem here:
>
> [root@faspbsy0002 bin]# service solr restart
> Sending stop command to Solr running on port 8983 ... waiting up to 180
> seconds to allow Jetty process 38947 to stop gracefully.
> /opt/solr/bin/solr: line 2125: /opt/solr/bin/solr-8983.pid: Permission
> denied
>
> What is the practical effect if Solr can't write this solr-8983.pid file?
> What user should own the contents of /opt/solr/bin ?
>
> Thanks


Re: Analytics for Solr logs

2020-10-13 Thread Alexandre Rafalovitch
The tool was introduced in Solr 8.5 and it is in bin/postlogs
location. It is quite new.

Regards,
   Alex.

On Tue, 13 Oct 2020 at 12:39, Zisis T.  wrote:
>
> I've stumbled upon
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/logs.adoc
> which looks very interesting for getting insights into the Solr logs.
>
> I cannot find though postlogs command inside the Solr bin dir (there is post
> command though) nor a way to create the logs collection. I've looked into
> solr-8.4.1 and solr-7.5.0 but could not find anything.
>
> 1) Is this still supported?
> 2) Where can I find the logs collection configuration? How can I create it?
> 3) Is post the same command as postlogs?
>
> Thanks
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Folding Repeated Letters

2020-10-09 Thread Alexandre Rafalovitch
Are there that many of those words.?Because even if you deal with
, there is still yas!

Maybe you just have regexp synonyms? (ye+s+)

Good luck,
   413x

On Thu., Oct. 8, 2020, 6:02 p.m. Mike Drob,  wrote:

> I'm looking for a way to transform words with repeated letters into the
> same token - does something like this exist out of the box? Do our stemmers
> support it?
>
> For example, say I would want all of these terms to return the same search
> results:
>
> YES
> YESSS
> YYYEEESSS
> YYEE[...]S
>
> I don't know how long a user would hold down the S key at the end to
> capture their level of excitement, and I don't want to manually define
> synonyms for every length.
>
> I'm pretty sure that I don't want PhoneticFilter here, maybe
> PatternReplace? Not a huge fan of how that one is configured, and I think
> I'd have to set up a bunch of patterns inline for it?
>
> Mike
>


Re: Solr endpoint on the public internet

2020-10-08 Thread Alexandre Rafalovitch
Could be fun red/blue team exercise. Just watch out for those
cryptominors that get in through Solr injection (among many other
unsecured methods) and are a real pain to remove.

Regards,
   Alex.
P.s. Don't ask me how I know :-(
P.p.s. Read-only docker container may still be a good layer of defence
on top of everything. Respawn it every hour, if needed.

On Thu, 8 Oct 2020 at 15:05, David Hastings  wrote:
>
> Welp. Never mind I refer back to point #1 this is a bad idea
>
> > On Oct 8, 2020, at 3:01 PM, Alexandre Rafalovitch  
> > wrote:
> >
> > The update handlers are now implicitly defined (3 or 4 of them). So,
> > it actually needs to be explicitly shadowed and overridden with other
> > Noop handler. And block Config API to avoid attackers creating new
> > handlers.
> >
> > Regards,
> >   Alex.
> >
> >> On Thu, 8 Oct 2020 at 14:54, David Hastings  wrote:
> >>
> >> Well that’s why I suggested deleting the update handler :)
> >>
> >>>> On Oct 8, 2020, at 2:52 PM, Walter Underwood  
> >>>> wrote:
> >>>
> >>> Let me know where it is and I’ll delete all the documents in your 
> >>> collection.
> >>> It is easy, just one HTTP request.
> >>>
> >>> https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On Oct 8, 2020, at 11:49 AM, Alexandre Rafalovitch  
> >>>> wrote:
> >>>>
> >>>> I think there were past discussions about people doing but they really
> >>>> really knew what they were doing from a security perspective, not just
> >>>> Solr one.
> >>>>
> >>>> You are increasing your risk factor a lot, so you need to think
> >>>> through this. What are you protecting and what are you exposing. Are
> >>>> you trying to protect the updates? You may be able to do it with - for
> >>>> example - read-only docker container, or with embedded Solr or/and
> >>>> with reverse proxy.
> >>>>
> >>>> Are you trying to protect some of the data from being read? Even harder.
> >>>>
> >>>> There are implicit handlers, admin handlers, 'qt' to select query
> >>>> parser, etc. Lots of things to think about.
> >>>>
> >>>> It just may not be worth it.
> >>>>
> >>>> Regards,
> >>>> Alex.
> >>>>
> >>>>
> >>>>> On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  
> >>>>> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> We're looking into the option of setting up search with Solr without an
> >>>>> intermediary application. This would mean our backend would index data 
> >>>>> into
> >>>>> Solr and we would have a public Solr endpoint on the internet that would
> >>>>> receive search requests directly.
> >>>>>
> >>>>> Since I couldn't find an existing solution similar to ours, I would 
> >>>>> like to
> >>>>> know whether it's possible to secure Solr in a way that allows anyone 
> >>>>> only
> >>>>> read-access only to collections and how to achieve that. Specifically
> >>>>> because of this part of the documentation
> >>>>> <https://lucene.apache.org/solr/guide/8_5/securing-solr.html>:
> >>>>>
> >>>>> *No Solr API, including the Admin UI, is designed to be exposed to
> >>>>> non-trusted parties. Tune your firewall so that only trusted computers 
> >>>>> and
> >>>>> people are allowed access. Because of this, the project will not regard
> >>>>> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
> >>>>> ask you to report such issues in JIRA.*
> >>>>> Is there a way we can restrict read-only access to Solr collections so 
> >>>>> as
> >>>>> to allow users to make search requests directly to it or should we 
> >>>>> always
> >>>>> keep our Solr instances completely private?
> >>>>>
> >>>>> Thanks in advance!
> >>>>>
> >>>>> Best regards,
> >>>>> Marco Godinho
> >>>


Re: Solr endpoint on the public internet

2020-10-08 Thread Alexandre Rafalovitch
The update handlers are now implicitly defined (3 or 4 of them). So,
it actually needs to be explicitly shadowed and overridden with other
Noop handler. And block Config API to avoid attackers creating new
handlers.

Regards,
   Alex.

On Thu, 8 Oct 2020 at 14:54, David Hastings  wrote:
>
> Well that’s why I suggested deleting the update handler :)
>
> > On Oct 8, 2020, at 2:52 PM, Walter Underwood  wrote:
> >
> > Let me know where it is and I’ll delete all the documents in your 
> > collection.
> > It is easy, just one HTTP request.
> >
> > https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Oct 8, 2020, at 11:49 AM, Alexandre Rafalovitch  
> >> wrote:
> >>
> >> I think there were past discussions about people doing but they really
> >> really knew what they were doing from a security perspective, not just
> >> Solr one.
> >>
> >> You are increasing your risk factor a lot, so you need to think
> >> through this. What are you protecting and what are you exposing. Are
> >> you trying to protect the updates? You may be able to do it with - for
> >> example - read-only docker container, or with embedded Solr or/and
> >> with reverse proxy.
> >>
> >> Are you trying to protect some of the data from being read? Even harder.
> >>
> >> There are implicit handlers, admin handlers, 'qt' to select query
> >> parser, etc. Lots of things to think about.
> >>
> >> It just may not be worth it.
> >>
> >> Regards,
> >>  Alex.
> >>
> >>
> >>> On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  
> >>> wrote:
> >>>
> >>> Hi!
> >>>
> >>> We're looking into the option of setting up search with Solr without an
> >>> intermediary application. This would mean our backend would index data 
> >>> into
> >>> Solr and we would have a public Solr endpoint on the internet that would
> >>> receive search requests directly.
> >>>
> >>> Since I couldn't find an existing solution similar to ours, I would like 
> >>> to
> >>> know whether it's possible to secure Solr in a way that allows anyone only
> >>> read-access only to collections and how to achieve that. Specifically
> >>> because of this part of the documentation
> >>> <https://lucene.apache.org/solr/guide/8_5/securing-solr.html>:
> >>>
> >>> *No Solr API, including the Admin UI, is designed to be exposed to
> >>> non-trusted parties. Tune your firewall so that only trusted computers and
> >>> people are allowed access. Because of this, the project will not regard
> >>> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
> >>> ask you to report such issues in JIRA.*
> >>> Is there a way we can restrict read-only access to Solr collections so as
> >>> to allow users to make search requests directly to it or should we always
> >>> keep our Solr instances completely private?
> >>>
> >>> Thanks in advance!
> >>>
> >>> Best regards,
> >>> Marco Godinho
> >


Re: Solr endpoint on the public internet

2020-10-08 Thread Alexandre Rafalovitch
I think there were past discussions about people doing but they really
really knew what they were doing from a security perspective, not just
Solr one.

You are increasing your risk factor a lot, so you need to think
through this. What are you protecting and what are you exposing. Are
you trying to protect the updates? You may be able to do it with - for
example - read-only docker container, or with embedded Solr or/and
with reverse proxy.

Are you trying to protect some of the data from being read? Even harder.

There are implicit handlers, admin handlers, 'qt' to select query
parser, etc. Lots of things to think about.

It just may not be worth it.

Regards,
   Alex.


On Thu, 8 Oct 2020 at 14:27, Marco Aurélio  wrote:
>
> Hi!
>
> We're looking into the option of setting up search with Solr without an
> intermediary application. This would mean our backend would index data into
> Solr and we would have a public Solr endpoint on the internet that would
> receive search requests directly.
>
> Since I couldn't find an existing solution similar to ours, I would like to
> know whether it's possible to secure Solr in a way that allows anyone only
> read-access only to collections and how to achieve that. Specifically
> because of this part of the documentation
> :
>
> *No Solr API, including the Admin UI, is designed to be exposed to
> non-trusted parties. Tune your firewall so that only trusted computers and
> people are allowed access. Because of this, the project will not regard
> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
> ask you to report such issues in JIRA.*
> Is there a way we can restrict read-only access to Solr collections so as
> to allow users to make search requests directly to it or should we always
> keep our Solr instances completely private?
>
> Thanks in advance!
>
> Best regards,
> Marco Godinho


Re: MappingCharFilterFactory weird behaviour

2020-10-05 Thread Alexandre Rafalovitch
How do you know it does not apply?

My Doh moment is often forgetting that stored version of the field is not
affected by analyzers. One has to look in schema Admin UI to check indexed
values.

Regards,
   Alex

On Mon., Oct. 5, 2020, 6:01 a.m. Lukas Brune, 
wrote:

> Hello!
>
> I'm having some troubles with using MappingCharFilterFactory in my schema.
> We're using it to replace some escaped html entities
> so HTMLStripCharFilterFactory can take care of those.
>
> When testing this out in Analysis it works perfectly, however, when adding
> elements to Solr, the mapping doesn't seem to apply.
>
> We're currently copying some other fields into the field with the replaces,
> so it's a MultiValued field. (Don't know if that makes a difference)
>
>  positionIncrementGap="100">
> 
>mapping="mapping.txt"/>
>   
>// other stuff
> 
>   
>
>
>
>
> multiValued="true" required="false" termVectors="true" termPositions="true"
> termOffsets="true"/>
>
>
> Best Regards,
> *Lukas Brune* | Machine Learning Engineer & Web Developer | Comintelli AB
> lukas.br...@comintelli.com| Mobile:+46(0)706229823 |
> www.intelligence2day.com
>
> 
>


Re: advice on whether to use stopwords for use case

2020-09-30 Thread Alexandre Rafalovitch
You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  wrote:
>
> I am not sure why you think stop words are your first choice. Maybe I
> misunderstand the question. I read it as that you need to exclude
> completely a set of documents that include specific keywords when
> called from specific module.
>
> If I wanted to differentiate the searches from specific module, I
> would give that module a different end-point (Request Query Handler),
> instead of /select. So, /nocigs or whatever.
>
> Then, in that end-point, you could do all sorts of extra things, such
> as setting appends or even invariants parameters, which would include
> filter query to exclude any documents matching specific keywords. I
> assume it is ok to return documents that are matching for other
> reasons.
>
> Ideally, you would mark the cigs documents during indexing with a
> binary or enumeration flag and then during search you just need to
> check against that flag. In that case, you could copyField  your text
> and run it against something like
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
> combined with Shingles for multiwords. Or similar. And just transform
> it as index-only so that the result is basically a yes/no flag.
> Similar thing could be done with UpdateRequestProcessor pipeline if
> you want to end up with a true boolean flag. The idea is the same,
> just to have an index-only flag that you force lock into for any
> request from specific module.
>
> Or even with something like ElevationSearchComponent. Same idea.
>
> Hope this helps.
>
> Regards,
>Alex.
>
> On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
> >
> > Hi
> >
> > I have read in the mailings list that we should try to avoid using stop
> > words.
> >
> > I have a use case where I would like to know if there is other
> > alternative solutions beside using stop words.
> >
> > There is business requirement to return zero result when the search is
> > cigarette related words and the search is coming from a particular
> > module on our site. It does not apply to all searches from our site.
> > There is a list of these cigarette related words. This list contains
> > single word, multiple words (Electronic cigar), multiple words with
> > punctuation (e-cigarette case).
> > I am planning to copy a different set of search fields, that will
> > include the stopword filter in the index and query stage, for this
> > module to use.
> >
> > For this use case, other than using stop words to handle it, is there
> > any alternative solution?
> >
> > Derek
> >
> > --
> > CONFIDENTIALITY NOTICE
> >
> > This e-mail (including any attachments) may contain confidential and/or 
> > privileged information. If you are not the intended recipient or have 
> > received this e-mail in error, please inform the sender immediately and 
> > delete this e-mail (including any attachments) from your computer, and you 
> > must not use, disclose to anyone else or copy this e-mail (including any 
> > attachments), whether in whole or in part.
> >
> > This e-mail and any reply to it may be monitored for security, legal, 
> > regulatory compliance and/or other appropriate reasons.


Re: advice on whether to use stopwords for use case

2020-09-29 Thread Alexandre Rafalovitch
I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
>
> Hi
>
> I have read in the mailings list that we should try to avoid using stop
> words.
>
> I have a use case where I would like to know if there is other
> alternative solutions beside using stop words.
>
> There is business requirement to return zero result when the search is
> cigarette related words and the search is coming from a particular
> module on our site. It does not apply to all searches from our site.
> There is a list of these cigarette related words. This list contains
> single word, multiple words (Electronic cigar), multiple words with
> punctuation (e-cigarette case).
> I am planning to copy a different set of search fields, that will
> include the stopword filter in the index and query stage, for this
> module to use.
>
> For this use case, other than using stop words to handle it, is there
> any alternative solution?
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
>
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and you 
> must not use, disclose to anyone else or copy this e-mail (including any 
> attachments), whether in whole or in part.
>
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.


Re: Slow Solr 8 response for long query

2020-09-29 Thread Alexandre Rafalovitch
What do the debug versions of the query show between two versions?

One thing that changed is sow (split on whitespace) parameter among
many. It is unlikely to be the cause, but I am mentioning just in
case.
https://lucene.apache.org/solr/guide/8_6/the-standard-query-parser.html#standard-query-parser-parameters

Regards,
   Alex

On Tue, 29 Sep 2020 at 20:47, Permakoff, Vadim
 wrote:
>
> Hi Solr Experts!
> We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long 
> query, which has a search text plus many OR and AND conditions (all in one 
> place, the query is about 20KB long).
> For the same set of data (about 500K docs) and the same schema the query in 
> Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to 
> get 10 results. If I increase the number of rows to 300, in Solr 6 it takes 
> about 10 sec, in Solr 8 it takes more than 1 min. The results are small, just 
> IDs. It looks like the relevancy scoring plays role, because if I move this 
> query to filter query - both Solr versions work pretty fast.
> The right way should be to change the query, but unfortunately it is 
> difficult to modify the application which creates these queries, so I want to 
> find some temporary workaround.
>
> What was changed from Solr 6 to Solr 8 in terms of scoring with many 
> conditions, which affects the search speed negatively?
> Is there anything to configure in Solr 8 to get the same performance for such 
> query like it was in Solr 6?
>
> Thank you,
> Vadim
>
> 
>
> This email is intended solely for the recipient. It may contain privileged, 
> proprietary or confidential information or material. If you are not the 
> intended recipient, please delete this email and any attachments and notify 
> the sender of the error.


Minimum set of jars to run EmbeddedSolrServer

2020-09-28 Thread Alexandre Rafalovitch
Hello,

Does anybody know (or even experimented) with what the minimum set of
jars needed to run EmbeddedSolrServer.

If I just include solr-core, that pulls in a huge number of Jars. I
don't need - for example - Lucene analyzers for Korean and Japanese
for this application.

But what else do I not need. Can I just not have hadoop? calcite?
curator? Are these all loaded on demand or will something fail?

Any pointers would be appreciated.

Regards,
Alex.


Re: Solr 8.6.2 UI issue

2020-09-25 Thread Alexandre Rafalovitch
Sounds strange. If you had Solr installed previously, it could be
cached Javascript. Force-reload or try doing it in an anonymous
window.

Also try starting with an example (solr/start -e techproducts).

Finally, if you are up to it, see if there are any serious errors in
the Browser's developer console's log.

If all else fails, try an earlier version of Solr, just to check
whether it could be something about the latest version (unlikely).

Regards,
   Alex

On Fri, 25 Sep 2020 at 18:43, Manisha Rahatadkar
 wrote:
>
> Hello All
>
> I downloaded 8.6.2 and running it on windows 10 machine. Solr starts on 8983 
> port but whenever I click on any menu like Logging, Core Admin, Query it 
> always shows only dashboard screen.
> Has anyone experienced this issue?
>
>
> Regards
> Manisha Rahatadkar
>
> Confidentiality Notice
> 
> This email message, including any attachments, is for the sole use of the 
> intended recipient and may contain confidential and privileged information. 
> Any unauthorized view, use, disclosure or distribution is prohibited. If you 
> are not the intended recipient, please contact the sender by reply email and 
> destroy all copies of the original message. Anju Software, Inc. 4500 S. 
> Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.


Re: Solr 8.6.2 text_general

2020-09-25 Thread Alexandre Rafalovitch
Ok, something is definitely not right. In those cases, I suggest
checking backwards from hard reality. Just in case the file you are
looking at is NOT the one that is actually used when collection is
actually setup. Happened to me more times than I can count.

Point your Admin UI to the collection you are having issues and check
the schema definitions there (either in Files or even in Schema
screen). I still think your multiValued definition changed somewhere.

Regards,
  Alex.

On Fri, 25 Sep 2020 at 03:57, Anuj Bhargava  wrote:
>
> Schema on both are the same
>
>
>
>
>
>
> Regards,
>
> Anuj
>
> On Thu, 24 Sep 2020 at 18:58, Alexandre Rafalovitch 
> wrote:
>
> > These are field definitions for _text_ and text, your original
> > question was about the fields named "country"/"currency" and whatever
> > type they mapped to.
> >
> > Your text/_text_ field is not actually returned to the browser,
> > because it is "stored=false", so it is most likely a catch-all
> > copyField destination. You may be searching against it, but you are
> > returning other (original) fields.
> >
> > Regards,
> >Alex.
> >
> > On Thu, 24 Sep 2020 at 09:23, Anuj Bhargava  wrote:
> > >
> > > In both it is the same
> > >
> > > In Solr 8.0.0
> > >  > indexed="true"
> > > stored="false"/>
> > >
> > > In Solr 8.6.2
> > >  > > multiValued="true"/>
> > >
> > > On Thu, 24 Sep 2020 at 18:33, Alexandre Rafalovitch 
> > > wrote:
> > >
> > > > I think that means your field went from multiValued to singleValued.
> > > > Double check your schema. Remember that multiValued flag can be set
> > > > both on the field itself and on its fieldType.
> > > >
> > > > Regards,
> > > >Alex
> > > > P.s. However if your field is supposed to be single-valued, maybe you
> > > > should treat it as a feature not a bug. Multivalued fields have some
> > > > restrictions that single-valued fields do not have (around sorting for
> > > > example).
> > > >
> > > > On Thu, 24 Sep 2020 at 03:09, Anuj Bhargava 
> > wrote:
> > > > >
> > > > > In solr 8.0.0 when running the query the data (type="text_general")
> > was
> > > > > shown in brackets *[ ]*
> > > > > "country":*[*"IN"*]*,
> > > > > "currency":*[*"INR"*]*,
> > > > > "date_c":"2020-08-23T18:30:00Z",
> > > > > "est_cost":0,
> > > > >
> > > > > However, in solr 8.6.2 the query the data (type="text_general") is
> > not
> > > > > showing in brackets [ ]
> > > > > "country":"IN",
> > > > > "currency":"INR",
> > > > > "date_c":"2020-08-23T18:30:00Z",
> > > > > "est_cost":0,
> > > > >
> > > > >
> > > > > How to get the query results to show brackets in Solr 8.6.2
> > > >
> >


Re: Index Deeply Nested documents and retrieve a full nested document in solr

2020-09-24 Thread Alexandre Rafalovitch
It is yes to both questions, but I am not sure if they play well
together for historical reasons.

For storing/parsing original JSON in any (custom) format:
https://lucene.apache.org/solr/guide/8_6/transforming-and-indexing-custom-json.html
(srcField parameter)
For indexing nested children (with named collections of subdocuments)
but in Solr's own JSON format:
https://lucene.apache.org/solr/guide/8_6/indexing-nested-documents.html

I am not sure if defining additional fields as per the second document
but indexing the first way will work together. A feedback on that
would be useful.

Please also note that Solr is not intended to be the primary storage
(like a database). If you do atomic operations, the stored JSON will
get out of sync as it is not regenerated. Also, for the advanced
searches, you may want to normalize your data in different ways than
those your original data structure has. So, you may want to consider
an architecture where that JSON is stored separately or is retrieved
from original database and the Solr is focused on good search and
returning you just the record ID. That would actually allow you to
store a lot less in Solr (like just IDs) and focus on indexing in the
best way. Not saying it is the right way for your needs, just that is
a non-obvious architecture choice you may want to keep in mind as you
add Solr to your existing stack.

Regards,
   Alex.

On Thu, 24 Sep 2020 at 10:23, Abhay Kumar  wrote:
>
> Hello Team,
>
> Can someone please help to index the below sample json document into Solr.
>
> I have following queries on indexing multi level child document.
>
>
>   1.  Can we specify names to documents hierarchy such as "therapeuticareas" 
> or "sites" while indexing.
>   2.  How can we index document at multi-level hierarchy.
>
> I have following queries on retrieving the result.
>
>
>   1.  How can I retrieve result with full nested structure.
>
> [{
>"id": "NCT0102",
>"title": "Congenital Adrenal Hyperplasia: Calcium Channels as 
> Therapeutic Targets",
>"phase": "Phase 1/Phase 2",
>"status": "Completed",
>"studytype": "Interventional",
>"enrollmenttype": "",
>"sponsorname": ["National Center for Research Resources 
> (NCRR)"],
>"sponsorrole": ["lead"],
>"score": [0],
>"source": "National Center for Research Resources (NCRR)",
>"therapeuticareas": [{
>  "taid": "ta1",
>  "ta": "Lung Cancer",
>  "diseaseAreas": ["Oncology, 
> Respiratory tract diseases"],
>  "pubmeds": [{
> "pmbid": "pm1",
> "articleTitle": 
> "Consensus minimum data set for lung cancer multidisciplinary teams Results 
> of a Delphi process",
> "revisedDate": 
> "2018-12-11T18:30:00Z"
>  }],
>  "conferences": [{
> "confid": "conf1",
> "conferencename": 
> "American Academy of Neurology Annual Meeting",
> 
> "conferencetopic": "Avances en el manejo de los trastornos del movimiento 
> hipercineticos",
> "conferencedate": 
> "2019-05-08T18:30:00Z"
>  }]
>   },
>   {
>  "taid": "ta2",
>  "ta": "Breast Cancer",
>  "diseaseAreas": ["Oncology"],
>  "pubmeds": [],
>  "conferences": []
>   }
>],
>
>"sites": [{
>   "siteid": "site1",
>   "type": "Hospital",
>   "institutionname": "Methodist Health System",
>   "country": "United States",
>   "state": "Texas",
>   "city": "Dallas",
>   "zip": ""
>}],
>
>"investigators": [{
>   "invid": "inv1",
>   "investigatorname": "Bryan A Faller",
>   "role": "Principal Investigator",
>   "location": "",
>  

Re: Solr 8.6.2 text_general

2020-09-24 Thread Alexandre Rafalovitch
These are field definitions for _text_ and text, your original
question was about the fields named "country"/"currency" and whatever
type they mapped to.

Your text/_text_ field is not actually returned to the browser,
because it is "stored=false", so it is most likely a catch-all
copyField destination. You may be searching against it, but you are
returning other (original) fields.

Regards,
   Alex.

On Thu, 24 Sep 2020 at 09:23, Anuj Bhargava  wrote:
>
> In both it is the same
>
> In Solr 8.0.0
>  stored="false"/>
>
> In Solr 8.6.2
>  multiValued="true"/>
>
> On Thu, 24 Sep 2020 at 18:33, Alexandre Rafalovitch 
> wrote:
>
> > I think that means your field went from multiValued to singleValued.
> > Double check your schema. Remember that multiValued flag can be set
> > both on the field itself and on its fieldType.
> >
> > Regards,
> >Alex
> > P.s. However if your field is supposed to be single-valued, maybe you
> > should treat it as a feature not a bug. Multivalued fields have some
> > restrictions that single-valued fields do not have (around sorting for
> > example).
> >
> > On Thu, 24 Sep 2020 at 03:09, Anuj Bhargava  wrote:
> > >
> > > In solr 8.0.0 when running the query the data (type="text_general") was
> > > shown in brackets *[ ]*
> > > "country":*[*"IN"*]*,
> > > "currency":*[*"INR"*]*,
> > > "date_c":"2020-08-23T18:30:00Z",
> > > "est_cost":0,
> > >
> > > However, in solr 8.6.2 the query the data (type="text_general") is not
> > > showing in brackets [ ]
> > > "country":"IN",
> > > "currency":"INR",
> > > "date_c":"2020-08-23T18:30:00Z",
> > > "est_cost":0,
> > >
> > >
> > > How to get the query results to show brackets in Solr 8.6.2
> >


Re: Solr 8.6.2 text_general

2020-09-24 Thread Alexandre Rafalovitch
I think that means your field went from multiValued to singleValued.
Double check your schema. Remember that multiValued flag can be set
both on the field itself and on its fieldType.

Regards,
   Alex
P.s. However if your field is supposed to be single-valued, maybe you
should treat it as a feature not a bug. Multivalued fields have some
restrictions that single-valued fields do not have (around sorting for
example).

On Thu, 24 Sep 2020 at 03:09, Anuj Bhargava  wrote:
>
> In solr 8.0.0 when running the query the data (type="text_general") was
> shown in brackets *[ ]*
> "country":*[*"IN"*]*,
> "currency":*[*"INR"*]*,
> "date_c":"2020-08-23T18:30:00Z",
> "est_cost":0,
>
> However, in solr 8.6.2 the query the data (type="text_general") is not
> showing in brackets [ ]
> "country":"IN",
> "currency":"INR",
> "date_c":"2020-08-23T18:30:00Z",
> "est_cost":0,
>
>
> How to get the query results to show brackets in Solr 8.6.2


Re: Pining Solr

2020-09-18 Thread Alexandre Rafalovitch
Your builder parameter should be up to the collection, so only
"http://testserver-dtv:8984/solr/cpsearch;.
Then, on your Query object, you set
query.setRequestHandler("/select_cpsearch") as per
https://lucene.apache.org/solr/8_6_2/solr-solrj/org/apache/solr/client/solrj/SolrQuery.html#setRequestHandler-java.lang.String-

I am not sure what is happening with your ping, but I also believe
that there is a definition by default in the latest Solr. You could
see all the definitions (including defaults), by using config API
(see. https://lucene.apache.org/solr/guide/8_6/config-api.html)

Regards,
   Alex.

On Fri, 18 Sep 2020 at 15:18, Steven White  wrote:
>
> Hi Erick,
>
> I'm on Solr 8.6.1.  I did further debugging into this and just noticed that
> my search is not working too now (this is after I changed the request
> handler name from "select" to "select_cpsearch").  I have this very basic
> test code as a test which I think revailes the issue:
>
> try
> {
> SolrClient solrClient = new HttpSolrClient.Builder("
> http://testserver-dtv:8984/solr/cpsearch/select_cpsearch;).build();
> SolrQuery query = new SolrQuery();
> query.set("q", "*");
> QueryResponse response = solrClient.query(query);
> }
> catch (Exception ex)
> {
> ex.printStackTrace();  // has this:
> "URI:/solr/cpsearch/select_cpsearch/select"
> }
>
> In the stack, there is this message (I'm showing the relevant part only):
>
> Error 404 Not Found
> 
> HTTP ERROR 404 Not Found
> 
> URI:/solr/cpsearch/select_cpsearch/select
>
> As you can see "select" got added to the URI.  I think this is the root
> cause for the ping issue too that I'm having, but even if it is not, I have
> to fix this search issue too but I don't know how to tell SolrJ to use my
> named search request handler.  Any ideas?
>
> Thanks.
>
> Steven
>
>
> On Fri, Sep 18, 2020 at 2:24 PM Erick Erickson 
> wrote:
>
> > This looks kind of confused. I’m assuming what you’re after is a way to get
> > to your select_cpsearch request handler to test if Solr is alive and
> > calling that
> > “ping”.
> >
> > The ping request handler is just that, a separate request handler that you
> > hit by going to
> > http://sever:port/solr/admin/ping.
> >
> > It has nothing to do at all with your custom search handler and in recent
> > versions of
> > Solr is implicitly defined so it should just be there.
> >
> > Your custom handler is defined as
> > 
> >
> >


Re: NPE Issue with atomic update to nested document or child document through SolrJ

2020-09-17 Thread Alexandre Rafalovitch
The missing underscore is a documentation bug, because it was not
escaped the second time and the asciidoc chewed it up as an
bold/italic indicator. The declaration and references should match.

I am not sure about the code. I hope somebody else will step in on that part.

Regards,
   Alex.

On Thu, 17 Sep 2020 at 14:48, Pratik Patel  wrote:
>
> I am running this in a unit test which deletes the collection after the
> test is over. So every new test run gets a fresh collection.
>
> It is a very simple test where I am first indexing a couple of parent
> documents with few children and then testing an atomic update on one parent
> as I have posted in my previous message. (using UpdateRequest)
>
> I am not sure if I am triggering the atomic update correctly, do you see
> any potential issue in that code?
>
> I noticed something in the documentation here.
> https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html#indexing-nested-documents
>
>   name="_nest_path_" type="*nest_path*" />
>
> field_type is declared with name *"_nest_path_"* whereas field is declared
> with type *"nest_path". *
>
> Is this intentional? or should it be as follows?
>
>   name="_nest_path_" type="* _nest_path_ *" />
>
> Also, should we explicitly set index=true and store=true on _nest_path_
> and _nest_parent_ fields?
>
>
>
> On Thu, Sep 17, 2020 at 1:17 PM Alexandre Rafalovitch 
> wrote:
>
> > Did you reindex the original document after you added a new field? If
> > not, then the previously indexed content is missing it and your code
> > paths will get out of sync.
> >
> > Regards,
> >Alex.
> > P.s. I haven't done what you are doing before, so there may be
> > something I am missing myself.
> >
> >
> > On Thu, 17 Sep 2020 at 12:46, Pratik Patel  wrote:
> > >
> > > Thanks for your reply Alexandre.
> > >
> > > I have "_root_" and "_nest_path_" fields in my schema but not
> > > "_nest_parent_".
> > >
> > >
> > > 
> > > 
> > >  > > docValues="false" />
> > > 
> > >  > > name="_nest_path_" class="solr.NestPathField" />
> > >
> > > I ran my test after adding the "_nest_parent_" field and I am not getting
> > > NPE any more which is good. Thanks!
> > >
> > > But looking at the documents in the index, I see that after the atomic
> > > update, now there are two children documents with the same id. One
> > document
> > > has old values and another one has new values. Shouldn't they be merged
> > > based on the "id"? Do we need to specify anything else in the request to
> > > ensure that documents are merged/updated and not duplicated?
> > >
> > > For your reference, below is the test I am running now.
> > >
> > > // update field of one child doc
> > > SolrInputDocument sdoc = new SolrInputDocument(  );
> > > sdoc.addField( "id", testChildPOJO.id() );
> > > sdoc.addField( "conceptid", testChildPOJO.conceptid() );
> > > sdoc.addField( "storeid", "foo" );
> > > sdoc.setField( "fieldName",
> > > java.util.Collections.singletonMap("set", Collections.list("bar" ) ));
> > >
> > > final UpdateRequest req = new UpdateRequest();
> > > req.withRoute( pojo1.id() );// parent id
> > > req.add(sdoc);
> > >
> > > collection.client.request( req,
> > collection.getCollectionName()
> > > );
> > > collection.client.commit();
> > >
> > >
> > > Resulting documents :
> > >
> > > {id=c1_child1, conceptid=c1, storeid=s1,
> > fieldName=c1_child1_field_value1,
> > > startTime=Mon Sep 07 12:40:37 EDT 2020, integerField_iDF=10,
> > > booleanField_bDF=true, _root_=abcd, _version_=1678099970090074112}
> > > {id=c1_child1, conceptid=c1, storeid=foo, fieldName=bar, startTime=Mon
> > Sep
> > > 07 12:40:37 EDT 2020, integerField_iDF=10, booleanField_bDF=true,
> > > _root_=abcd, _version_=1678099970405695488}
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 17, 2020 at 12:01 PM Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > > wrote:
> > >
> > > > Can you double

Re: How to remove duplicate tokens from solr

2020-09-17 Thread Alexandre Rafalovitch
This is not quite enough information.
There is 
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
but it has specific limitations.

What is the problem that you are trying to solve that you feel is due
to duplicate tokens? Why are they duplicates? Is it about storage or
relevancy?

Regards,
   Alex.

On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo  wrote:
>
> Hi team,
>  Is there any way to remove duplicate tokens from solr. Is there any filter
> for this.


Re: Doing what does using SolrJ API

2020-09-17 Thread Alexandre Rafalovitch
Solr has a whole pipeline that you can run during document ingesting before
the actual indexing happens. It is called Update Request Processor (URP)
and is defined in solrconfig.xml or in an override file. Obviously, since
you are indexing from SolrJ client, you have even more flexibility, but it
is good to know about anyway.

You can read all about it at:
https://lucene.apache.org/solr/guide/8_6/update-request-processors.html and
see the extensive list of processors you can leverage. The specific
mentioned one is this one:
https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

Just a word of warning that Stateless URP is using Javascript, which is
getting a bit of a complicated story as underlying JVM is upgraded (Oracle
dropped their javascript engine in JDK 14). So if one of the simpler URPs
will do the job or a chain of them, that may be a better path to take.

Regards,
   Alex.


On Thu, 17 Sep 2020 at 13:13, Steven White  wrote:

> Thanks Erick.  Where can I learn more about "stateless script update
> processor factory".  I don't know what you mean by this.
>
> Steven
>
> On Thu, Sep 17, 2020 at 1:08 PM Erick Erickson 
> wrote:
>
> > 1000 fields is fine, you'll waste some cycles on bookkeeping, but I
> really
> > doubt you'll notice. That said, are these fields used for searching?
> > Because you do have control over what gous into the index if you can put
> a
> > "stateless script update processor factory" in your update chain. There
> you
> > can do whatever you want, including combine all the fields into one and
> > delete the original fields. There's no point in having your index
> cluttered
> > with unused fields, OTOH, it may not be worth the effort just to satisfy
> my
> > sense of aesthetics 
> >
> > On Thu, Sep 17, 2020, 12:59 Steven White  wrote:
> >
> > > Hi Eric,
> > >
> > > Yes, this is coming from a DB.  Unfortunately I have no control over
> the
> > > list of fields.  Out of the 1000 fields that there maybe, no document,
> > that
> > > gets indexed into Solr will use more then about 50 and since i'm
> copying
> > > the values of those fields to the catch-all field and the catch-all
> field
> > > is my default search field, I don't expect any problem for having 1000
> > > fields in Solr's schema, or should I?
> > >
> > > Thanks
> > >
> > > Steven
> > >
> > >
> > > On Thu, Sep 17, 2020 at 8:23 AM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > “there over 1000 of them[fields]”
> > > >
> > > > This is often a red flag in my experience. Solr will handle that many
> > > > fields, I’ve seen many more. But this is often a result of
> > > > “database thinking”, i.e. your mental model of how all this data
> > > > is from a DB perspective rather than a search perspective.
> > > >
> > > > It’s unwieldy to have that many fields. Obviously I don’t know the
> > > > particulars of
> > > > your app, and maybe that’s the best design. Particularly if many of
> the
> > > > fields
> > > > are sparsely populated, i.e. only a small percentage of the documents
> > in
> > > > your
> > > > corpus have any value for that field then taking a step back and
> > looking
> > > > at the design might save you some grief down the line.
> > > >
> > > > For instance, I’ve seen designs where instead of
> > > > field1:some_value
> > > > field2:other_value….
> > > >
> > > > you use a single field with _tokens_ like:
> > > > field:field1_some_value
> > > > field:field2_other_value
> > > >
> > > > that drops the complexity and increases performance.
> > > >
> > > > Anyway, just a thought you might want to consider.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Sep 16, 2020, at 9:31 PM, Steven White 
> > > wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > I figured it out.  It is as simple as creating a List and
> > using
> > > > > that as the value part for SolrInputDocument.addField() API.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Steven
> > > > >
> > > > >
> > > > > On Wed, Sep 16, 2020 at 9:13 PM Steven White  >
> > > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I want to avoid creating a  > > > >> source="OneFieldOfMany"/> in my schema (there will be over 1000 of
> > > them
> > > > and
> > > > >> maybe more so managing it will be a pain).  Instead, I want to use
> > > SolrJ
> > > > >> API to do what  does.  Any example of how I can do
> this?
> > > If
> > > > >> there is an example online, that would be great.
> > > > >>
> > > > >> Thanks in advance.
> > > > >>
> > > > >> Steven
> > > > >>
> > > >
> > > >
> > >
> >
>


Re: NPE Issue with atomic update to nested document or child document through SolrJ

2020-09-17 Thread Alexandre Rafalovitch
Did you reindex the original document after you added a new field? If
not, then the previously indexed content is missing it and your code
paths will get out of sync.

Regards,
   Alex.
P.s. I haven't done what you are doing before, so there may be
something I am missing myself.


On Thu, 17 Sep 2020 at 12:46, Pratik Patel  wrote:
>
> Thanks for your reply Alexandre.
>
> I have "_root_" and "_nest_path_" fields in my schema but not
> "_nest_parent_".
>
>
> 
> 
>  docValues="false" />
> 
>  name="_nest_path_" class="solr.NestPathField" />
>
> I ran my test after adding the "_nest_parent_" field and I am not getting
> NPE any more which is good. Thanks!
>
> But looking at the documents in the index, I see that after the atomic
> update, now there are two children documents with the same id. One document
> has old values and another one has new values. Shouldn't they be merged
> based on the "id"? Do we need to specify anything else in the request to
> ensure that documents are merged/updated and not duplicated?
>
> For your reference, below is the test I am running now.
>
> // update field of one child doc
> SolrInputDocument sdoc = new SolrInputDocument(  );
> sdoc.addField( "id", testChildPOJO.id() );
> sdoc.addField( "conceptid", testChildPOJO.conceptid() );
> sdoc.addField( "storeid", "foo" );
> sdoc.setField( "fieldName",
> java.util.Collections.singletonMap("set", Collections.list("bar" ) ));
>
> final UpdateRequest req = new UpdateRequest();
> req.withRoute( pojo1.id() );// parent id
> req.add(sdoc);
>
> collection.client.request( req, collection.getCollectionName()
> );
> collection.client.commit();
>
>
> Resulting documents :
>
> {id=c1_child1, conceptid=c1, storeid=s1, fieldName=c1_child1_field_value1,
> startTime=Mon Sep 07 12:40:37 EDT 2020, integerField_iDF=10,
> booleanField_bDF=true, _root_=abcd, _version_=1678099970090074112}
> {id=c1_child1, conceptid=c1, storeid=foo, fieldName=bar, startTime=Mon Sep
> 07 12:40:37 EDT 2020, integerField_iDF=10, booleanField_bDF=true,
> _root_=abcd, _version_=1678099970405695488}
>
>
>
>
>
>
> On Thu, Sep 17, 2020 at 12:01 PM Alexandre Rafalovitch 
> wrote:
>
> > Can you double-check your schema to see if you have all the fields
> > required to support nested documents. You are supposed to get away
> > with just _root_, but really you should also include _nest_path and
> > _nest_parent_. Your particular exception seems to be triggering
> > something (maybe a bug) related to - possibly - missing _nest_path_
> > field.
> >
> > See:
> > https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html#indexing-nested-documents
> >
> > Regards,
> >Alex.
> >
> > On Wed, 16 Sep 2020 at 13:28, Pratik Patel  wrote:
> > >
> > > Hello Everyone,
> > >
> > > I am trying to update a field of a child document using atomic updates
> > > feature. I am using solr and solrJ version 8.5.0
> > >
> > > I have ensured that my schema satisfies the conditions for atomic updates
> > > and I am able to do atomic updates on normal documents but with nested
> > > child documents, I am getting a Null Pointer Exception. Following is the
> > > simple test which I am trying.
> > >
> > > TestPojo  pojo1  = new TestPojo().cId( "abcd" )
> > > >  .conceptid( "c1" )
> > > >  .storeid( storeId )
> > > >  .testChildPojos(
> > > > Collections.list( testChildPOJO, testChildPOJO2,
> > > >
> > testChildPOJO3 )
> > > > );
> > > > TestChildPOJOtestChildPOJO = new TestChildPOJO().cId(
> > > > "c1_child1" )
> > > >   .conceptid( "c1"
> > )
> > > >   .storeid(
> > storeId )
> > > >   .fieldName(
> > > > "c1_child1_field_value1" )
> > > >   .startTime(
> > > > Date.from( now.minus( 10, ChronoUnit.DAYS ) ) )
> > > >
> >  .inte

Re: NPE Issue with atomic update to nested document or child document through SolrJ

2020-09-17 Thread Alexandre Rafalovitch
Can you double-check your schema to see if you have all the fields
required to support nested documents. You are supposed to get away
with just _root_, but really you should also include _nest_path and
_nest_parent_. Your particular exception seems to be triggering
something (maybe a bug) related to - possibly - missing _nest_path_
field.

See: 
https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html#indexing-nested-documents

Regards,
   Alex.

On Wed, 16 Sep 2020 at 13:28, Pratik Patel  wrote:
>
> Hello Everyone,
>
> I am trying to update a field of a child document using atomic updates
> feature. I am using solr and solrJ version 8.5.0
>
> I have ensured that my schema satisfies the conditions for atomic updates
> and I am able to do atomic updates on normal documents but with nested
> child documents, I am getting a Null Pointer Exception. Following is the
> simple test which I am trying.
>
> TestPojo  pojo1  = new TestPojo().cId( "abcd" )
> >  .conceptid( "c1" )
> >  .storeid( storeId )
> >  .testChildPojos(
> > Collections.list( testChildPOJO, testChildPOJO2,
> >  testChildPOJO3 )
> > );
> > TestChildPOJOtestChildPOJO = new TestChildPOJO().cId(
> > "c1_child1" )
> >   .conceptid( "c1" )
> >   .storeid( storeId )
> >   .fieldName(
> > "c1_child1_field_value1" )
> >   .startTime(
> > Date.from( now.minus( 10, ChronoUnit.DAYS ) ) )
> >   .integerField_iDF(
> > 10 )
> >
> > .booleanField_bDF(true);
> > // index pojo1 with child testChildPOJO
> > SolrInputDocument sdoc = new SolrInputDocument();
> > sdoc.addField( "_route_", pojo1.cId() );
> > sdoc.addField( "id", testChildPOJO.cId() );
> > sdoc.addField( "conceptid", testChildPOJO.conceptid() );
> > sdoc.addField( "storeid", testChildPOJO.cId() );
> > sdoc.setField( "fieldName", java.util.Collections.singletonMap("set",
> > Collections.list(testChildPOJO.fieldName() + postfix) ) ); // modify field
> > "fieldName"
> > collection.client.add( sdoc );   // results in NPE!
>
>
> Stack Trace:
>
> ERROR org.apache.solr.client.solrj.impl.BaseCloudSolrClient - Request to
> > collection [collectionTest2] failed due to (500)
> > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> > from server at
> > http://172.15.1.100:8081/solr/collectionTest2_shard1_replica_n1:
> > java.lang.NullPointerException
> > at
> > org.apache.solr.update.processor.AtomicUpdateDocumentMerger.getFieldFromHierarchy(AtomicUpdateDocumentMerger.java:308)
> > at
> > org.apache.solr.update.processor.AtomicUpdateDocumentMerger.mergeChildDoc(AtomicUpdateDocumentMerger.java:405)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(DistributedUpdateProcessor.java:711)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:374)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
> > at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
> > at
> > org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
> > at
> > org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> > at
> > org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
> > at
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:332)
> > at
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:281)
> > at
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
> > at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
> > at
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:236)
> > at
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
> > at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
> > at
> > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196)
> > at
> > 

Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Alexandre Rafalovitch
There are a lot of different use cases and the separate analyzers for
indexing and query is part of the Solr power. For example, you could
apply ngram during indexing time to generate multiple substrings. But
you don't want to do that during the query, because otherwise you are
matching on 'shared prefix' instead of on what user entered. Thinking
phone number directory where people may enter any suffix and you want
to match it.
See for example
https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
, starting slide 16 onwards.

Or, for non-production but fun use case:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
 (search phonetically mapped Thai text in English).

Similarly, you may want to apply synonyms at query time only if you
want to avoid diluting some relevancy. Or at index type to normalize
spelling and help relevancy.

Or you may want to be doing some accent folding for sorting or
faceting (which uses indexed tokens).

Regards,
   Alex.

On Thu, 10 Sep 2020 at 11:19, Steven White  wrote:
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different logic
> for "index" than for "query".  To be clear, I"m talking about this block:
>
>  positionIncrementGap="100">
>   
>
>   
>   
>
>   
> 
>
> Why would one want to not use the same logic for both and simply use:
>
>  positionIncrementGap="100">
>   
>
>   
> 
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve


Re: Inverse English an digits in Arabic Text

2020-09-08 Thread Alexandre Rafalovitch
If you are uploading a PDF, then you must be doing it via Tika or via
an extract handler (which uses Tika under the covers).

Try getting a standalone Tika of the same version and see what it
outputs. Perhaps there is something in those specific PDF pages that
confuse Tika. Like, if it used different font for English text and
therefore Adobe encoded each letter individually and therefore broke
the flow. PDF is not a content format, but presentation format. These
things happen.

Regards,
   Alex

On Tue, 8 Sep 2020 at 09:11,  wrote:
>
>
> Thank you for support,
>
> I upload PDF file page by page. And in this case left to right (LTR) or right 
> to left (RTL) reading apples for the whole document not for the specific text 
> block ( separate for Arabic, separate for Enlish)
>
> I can see the same behavior with output for via  /select as well as /browse 
> call
>
> Almost sure the problem is with during upload
> 
>
> But adding this to the
>and latter to another analyzer does not change the 
> result.
>
>


Re: Inverse English an digits in Arabic Text

2020-09-07 Thread Alexandre Rafalovitch
> Doc in Arabic with some English - English text is inverted (for example,
"gro.echapa.www"), what makes search by key words impossible.

What very specifically do you mean by that. How do you see the inversion?

If that's within some sort of web ui, then you are probably seeing the HTML
bidi (bidirectional LTR/RTL) presentation issues.

And if you are seeing in in Cloudera UI, then the question may be for their
forum.

One way to test is to have English text in brackets "(www.apache.org)"
within Arabic flow. If you see again your issue but the brackets get weird
"((gro.", this is most likely a bidi presentation issue with algorithm
or HTML attribute set to RTL.

Could be something else though, but that would be a start point.

Regards,
Alex


On Mon., Sep. 7, 2020, 5:54 a.m. ,  wrote:

> Hi,
>
> Could please help to resolve an issue. I upload/index several documents in
> English and in Arabic languages to SOLR, in addition I use handler for
> Arabic language:
>   
>
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>   class="solr.RemoveDuplicatesTokenFilterFactory"/>
>   class="solr.ArabicNormalizationFilterFactory"/>
> 
> 
>
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  ignoreCase="true" expand="true"/>
>   class="solr.RemoveDuplicatesTokenFilterFactory"/>
>class="solr.ArabicNormalizationFilterFactory"/>
> 
> 
>
>   
>
> There are two environments:
> Local machine: - SOLR version: 4,2
> - Windows version: 10
>
> DEV env: - SOLR version 4.1 as part of the cloudera suit
> - Linux core version: 3.10.0-862
>
> Issue appears when uploading documents:
> Local machine: - Doc in English with English words only -
> ok (for example, "www.apache.org")
> - Doc in Arabic with some English words - ok (for example,
> "www.apache.org")
>
> DEV env: - Doc in English with English words only - ok
> (for example, "www.apache.org")
> - Doc in Arabic with some English - English text is
> inverted (for example, "gro.echapa.www"), what makes search by key words
> impossible.
>
> Please advise whether this fixable and how?
>
> Thank you in advance!
>


Re: Must specify either 'defaultFieldType' or declare one typeMapping as default

2020-09-05 Thread Alexandre Rafalovitch
That's a really hard way to get introduced to Solr. What about
downloading Solr and running one of the built-in examples? Because you
are figuring out so many variables at once.

Either way, your specific issue is not in schema.xml (which should be
converted to managed-schema on first run, btw, don't have both in the
same directory).

Your issue is in the solrconfig.xml and it is complaining about
schemaless mode missing the default type mapping for when no other
rules match. As explained at:
https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/update/processor/AddSchemaFieldsUpdateProcessorFactory.html

In your linked example, that's the line:
strings

In latest Solr, it is using the default flag instead, so maybe you
ended up not having either. I think the example basically skips all
the mapping and creates all fields automatically of type strings. So,
you can probably remove all processors apart from add-schema-field in
updateRequestProcessorChain definition and reintroduce the
defaultFieldType from original example.

Hope that gets you where you want. If it does not, don't give up on
Solr, just step back a bit and get it working by itself first. Maybe
try following my presentation from a while ago:
https://www.slideshare.net/arafalov/from-content-to-search-speeddating-apache-solr-apachecon-2018-116330553
, the associated Github repo is at:
https://github.com/arafalov/solr-apachecon2018-presentation

Regards,
   Alex.


On Sat, 5 Sep 2020 at 20:32, Ronald Roeleveld
 wrote:
>
> Hi there,
> I'm a totally newb when it comes to Solr, I'm just trying to learn and
> teach myself new skills so please bear with me.
>
> I'm following a tutorial online:
> https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
>
> Since this tutorial is somewhat outdated I've changed some things to
> use current versions which is why I'm now running into issues.
> In order to solve any issues I've searched online and mostly solved
> everything with bits and pieces here and there. This part I can't seem
> to solve though.
>
> I've added a schema.xml and changed the solrconfig.xml as described in
> the tutorial and while Solr does start it gives me the following error
> message:
>
> mycol1: 
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> Must specify either 'defaultFieldType' or declare one typeMapping as
> default.
>
> I've had errors before related to outdated lines in schema.xml but I
> was able to fix those (I think).
>
> This is my current schema.xml
>
> -
> 
> 
>  required="true" multiValued="false" />
> 
>  multiValued="true"/>
> 
> 
> 
>  docValues="false" />
>  stored="false" multiValued="true"/>
> id
>  docValues="true" />
> 
>  positionIncrementGap="100" multiValued="true">
>   
> 
>  words="stopwords.txt" />
> 
> 
>   
>   
> 
>  words="stopwords.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>   
> 
>  sortMissingLast="true" multiValued="true"/>
>  docValues="true" multiValued="true"/>
>  docValues="true" multiValued="true"/>
>  docValues="true" multiValued="true"/>
> 
>
> --
>
> I hope someone is able to get me started with this. I'd like to learn
> but it feels like a really steep hill to climb.
>
> Kind regards,
> Ronald.


Re: Can't get Solr to work with Dovecot

2020-08-27 Thread Alexandre Rafalovitch
Is this a Solr-side message? Looks like dovecot doing proactive
trimming of some crazy long header.

You can lookup the record by UID in the Admin UI (UID=153535 instead
of *:*) to check what is being indexed. Check that dovecot does not do
any prefixing of field names (any record from first generic query will
show that)

Most likely (if it is a header) you did not want to search on it
anyway. And if the field type it indexes too is a string (therefore
not tokenized), you probably will not be able to search on it
meaningfully anyway.

Regards,
   Alex.

On Thu, 27 Aug 2020 at 16:14, Francis Augusto Medeiros-Logeay
 wrote:
>
> It works now! You were right - the files were on a different place. It
> seems to be working now.
>
> One last question:
>
> I got this error:
>
> 400/49727doveadm(fran...@francisaugusto.com): Warning:
> fts-solr(fran...@francisaugusto.com): Mailbox All Mail UID=153535 header
> size is huge, truncating
> 22400/49727
>
> How can I index this?
>
> Best,
>
> Francis
>
>
> On 2020-08-27 18:58, Francis Augusto Medeiros-Logeay wrote:
> > Hi,
> >
> > I have - for a long time now - hoped to use an fts engine with
> > dovecot. My dovecot version is 2.3.7.2 under Ubuntu 20.04.
> >
> > I installed solr 7.7.3 and then 8.6.0 to see if this was a
> > version-related error. I copied the schema from 7.7.0 as many people
> > said this was fine.
> >
> > I get the following error when trying to reindex a user's mailbox:
> >
> > doveadm(fran...@francisaugusto.com): Error: fts_solr: Indexing failed:
> > 400 Bad Request
> > doveadm(fran...@francisaugusto.com): Error: Mailbox INBOX: Transaction
> > commit failed: FTS transaction commit failed: backend deinit
> > doveadm(fran...@francisaugusto.com): Debug: auth-master: conn
> > unix:/var/run/dovecot/auth-userdb: Disconnected: Connection closed
> > (fd=10)
> >
> > On Solr I get this error:
> >
> > org.apache.solr.common.SolrException: Exception writing document id
> > 210/9fd7941e8297d25d9160c3fdd3da/fran...@francisaugusto.com to the
> > index; possible analysis error: cannot change field "box" from index
> > options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index
> > options=DOCS
> > at
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:246)
> >
> > Parallel to this, I got some log messages on Solr before attempting to
> > reindex the user (sorry for the garbadged text:
> >
> > Time (Local)  Level   CoreLogger  Message
> > 7/21/2020, 6:38:47 PM WARN false  x:dovecot   SolrResourceLoader
> >   Solr
> > loaded a deprecated plugin/analysis class [solr.TrieLongField]. Please
> > consult documentation how to replace it accordingly.
> > 7/21/2020, 6:38:47 PM WARN false  x:dovecot   SolrResourceLoader
> >   Solr
> > loaded a deprecated plugin/analysis class [solr.SynonymFilterFactory].
> > Please consult documentation how to replace it accordingly.
> > 7/21/2020, 6:38:47 PM WARN false  x:dovecot   SolrResourceLoader
> >   Solr
> > loaded a deprecated plugin/analysis class
> > [solr.WordDelimiterFilterFactory]. Please consult documentation how to
> > replace it accordingly.
> > 7/22/2020, 6:43:46 PM ERROR
> > false x:dovecot   RequestHandlerBase  
> > java.lang.IllegalStateException:
> > Type mismatch: uid was indexed as SORTED_NUMERIC
> > 7/22/2020, 6:43:46 PM ERROR
> > false x:dovecot   HttpSolrCallnull:java.lang.IllegalStateException:
> > Type mismatch: uid was indexed as SORTED_NUMERIC
> > 7/22/2020, 6:43:49 PM ERROR
> > false x:dovecot   RequestHandlerBase  
> > java.lang.IllegalStateException:
> > Type mismatch: uid was indexed as SORTED_NUMERIC
> > 7/22/2020, 6:43:49 PM ERROR
> > false x:dovecot   HttpSolrCallnull:java.lang.IllegalStateException:
> > Type mismatch: uid was indexed as SORTED_NUMERIC
> > 7/22/2020, 6:43:56 PM ERROR
> > false x:dovecot   RequestHandlerBase  
> > java.lang.IllegalStateException:
> > Type mismatch: uid was indexed as SORTED_NUMERIC
> > 7/22/2020, 6:43:56 PM ERROR
> > false x:dovecot   HttpSolrCallnull:java.lang.IllegalSta
> >
> > I am tried again. I get no errors when doing a `doveadm fts rescan`,
> > but get errors when trying this:
> >
> > doveadm index -u myu...@mydomain.com INBOX
> > doveadm(myu...@mydomain.com): Error: fts_solr: Indexing failed: 400 Bad
> > Request
> > doveadm(myu...@mydomain.com): Error: Mailbox INBOX: Transaction commit
> > failed: FTS transaction commit failed: backend deinit
> >
> > I guess this is a matter of waiting the reindex to be over?
> >
> > I get so many "Type mismatch" errors in Solr, except for this one that
> > looks different and showed up after trying the doveadm index command
> > above:
> >
> >
> > ERROR true
> > x:dovecot
> > RequestHandlerBase
> > org.apache.solr.common.SolrException: Exception writing document id
> > 210/9fd7941e8297d25d9160c3fdd3da/myu...@mydomain.com to the index;
> > possible analysis error: cannot change field "box" from index
> > 

Re: Exclude a folder/directory from indexing

2020-08-27 Thread Alexandre Rafalovitch
If you are indexing from Drupal into Solr, that's the question for
Drupal's solr module. If you are doing it some other way, which way
are you doing it? bin/post command?

Most likely this is not the Solr question, but whatever you have
feeding data into Solr.

Regards,
  Alex.

On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
 wrote:
>
> Can you or how do you exclude a specific folder/directory from indexing in 
> SOLR version 7.x or 8.x?   Also our CMS is Drupal 8
>
> Thanks,
>
> Phil Staley
> DCF Webmaster
> 608 422-6569
> phil.sta...@wisconsin.gov
>
>


Re: Can't get Solr to work with Dovecot

2020-08-27 Thread Alexandre Rafalovitch
Ok, you may want to step back and do a basic Solr example (download
matching version tgz file, decompress, "bin/solr -e techproducts" is a
good one, may need to shut the other Solr or give different port (-p
flag). Just so you know what you are looking at before dovecot starts
to introduce extra complexities. These are two full-featured products
we are trying to mesh.

But specifically regarding the URL, in http://localhost:8983/solr/dovecot/select
* http://localhost:8983 - server
* /solr - is a sort of compulsory part of URL
* /dovecot - is the name of the core/directory . You usually give URL
up to that point in configuration
* /select - is the request handler. Dovecot knows to add it (unless
you have custom schema, then they may have different one defined in
solrconfig.xml). But if you want to run your own query, you need to
have the full url and then send some params (?q=*:* is a good start)

Or, you can hit http://localhost:8983 in your browser and it will open
the Admin UI and show you the cores and you could run the Query from
there. Note that Admin UI for the core (with # in it) is not the URL
you should be using directly. In the query screen, it does show you
the full query string though.

While in Admin UI, have a look at core information and it will tell
you where the solr.home is (/var/solr/data?), where the conf directory
and data directories are (usually right under solr.home). If these are
not where you expect them, this will help to debug.

Regards,
   Alex.

On Thu, 27 Aug 2020 at 14:58, Francis Augusto Medeiros-Logeay
 wrote:
>
> Hi Alex and Erick,
>
> Thanks for helping out.
>
> True, restarting solr recreated the directory, but I still get 500
> internal errors when reindexing from Dovecot.
> Just to be clear: I delete the Data directory inside the
> solr/data/dovecot directory.
>
> All the directories are owned by solr:solr, so it doesn't look like it's
> a permission issue.
>
> I did a curl to see if I get something, but got this:
> $curl localhost:8983/solr/dovecot
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /solr/dovecot. Reason:
> Not Found
> 
> 
>
> Anything else I could try?
>
> Best,
>
> Francis
>
> On 2020-08-27 20:46, Alexandre Rafalovitch wrote:
> > Uhm right. I may have forgotten to mention that you do need to reload
> > the core or maybe restart Solr server as well. If you literally just
> > deleted the index, Solr is probably freaking out about suddenly gone
> > files. It needs to redo the path of "is this the first time or do I
> > reopen the indexes"
> >
> > Regards,
> >Alex.
> >
> > On Thu, 27 Aug 2020 at 14:29, Erick Erickson 
> > wrote:
> >>
> >> “write.lock” is used by Lucene to insure that no two cores open the
> >> same index because if they do, Bad Things Happen.
> >>
> >> The “NoSuchFileException” may be a bit misleading, is there any chance
> >> that any other core is looking at the same directory?
> >>
> >> And your assertion: I then deleted the "data" under the
> >> "/var/solr/data/dovecot/data", and it didn't get recreated.
> >> makes me _strongly_ suspect that your Solr data directories aren’t
> >> where you think they are.
> >>
> >> Hmmm, you start with: sudo -u solr /opt/solr/bin/solr create -c
> >> dovecot
> >>
> >> which makes me wonder if this is a permissions issue. What are the
> >> perms on that directory? Is the user that
> >> starts Solr able to write to them?
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> > On Aug 27, 2020, at 2:13 PM, Francis Augusto Medeiros-Logeay 
> >> >  wrote:
> >> >
> >> > Thanks Alex.
> >> >
> >> > Well, I just deleted the whole data, and configured it again, and get 
> >> > these errors from dovecot when indexing:
> >> >
> >> > doveadm(fran...@francisaugusto.com): Error: fts_solr: Indexing failed: 
> >> > 500 Server Error
> >> > doveadm(fran...@francisaugusto.com): Error: Mailbox UOL: Mail search 
> >> > failed: Internal error occurred. Refer to server log for more 
> >> > information. [2020-08-27 20:09:36]
> >> > doveadm(fran...@francisaugusto.com): Error: fts_solr: Indexing failed: 
> >> > 500 Server Error
> >> > doveadm(fran...@francisaugusto.com): Error: Mailbox UOL: Transaction 
> >> > commit failed: FTS transaction commit failed: backend deinit
> >> > doveadm(fran...@francisaugusto.com): Error: Mailbox All Mail: Mail 
> >> > search failed: Internal error 

Re: Can't get Solr to work with Dovecot

2020-08-27 Thread Alexandre Rafalovitch
gt;   at 
> > org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293)
> >   at 
> > org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:282)
> >   at 
> > org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:260)
> >   at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2073)
> >   ... 51 more
> > Caused by: java.nio.file.NoSuchFileException: 
> > /var/solr/data/dovecot/data/index/write.lock
> >   at 
> > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> >   at 
> > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> >   at 
> > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> >   at 
> > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> >   at 
> > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149)
> >   at 
> > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> >   at java.base/java.nio.file.Files.readAttributes(Files.java:1763)
> >   at 
> > org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:189)
> >   at 
> > org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
> >   at 
> > org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
> >   at 
> > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:112)
> >   at 
> > org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:128)
> >   at 
> > org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsWriter(Lucene50StoredFieldsFormat.java:183)
> >   at 
> > org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:39)
> >   at 
> > org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:46)
> >   at 
> > org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:367)
> >   at 
> > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403)
> >   at 
> > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
> >   at 
> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
> >   at 
> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
> >   at 
> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
> >   at 
> > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:971)
> >   at 
> > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:344)
> >   at 
> > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:291)
> >   at 
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:238)
> >   at 
> > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
> >   at 
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> >   at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
> >   at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
> >   at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
> >   at 
> > org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
> >   at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
> >   at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
> >   at 
> > org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
> >   at 
> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:261)
> >   ... 43 more
> >
> > I then deleted the "data" under the "/var/solr/data/dovecot/

Re: Can't get Solr to work with Dovecot

2020-08-27 Thread Alexandre Rafalovitch
Have you tried blowing the index directory away (usually 'data'
directory next to 'conf'). Because:
cannot change field "box" from index
options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index
options=DOCS

This implies that your field box had different definitions, you
updated it but the index files still remember the old stuff. When you
try to reindex, it fails as it does not know you will delete
everything in the end.

So, delete (backup/move just in case) the whole index and try again.
There may be other errors, but that's the first one.

Regards,
   Alex.

On Thu, 27 Aug 2020 at 12:58, Francis Augusto Medeiros-Logeay
 wrote:
>
> Hi,
>
> I have - for a long time now - hoped to use an fts engine with dovecot.
> My dovecot version is 2.3.7.2 under Ubuntu 20.04.
>
> I installed solr 7.7.3 and then 8.6.0 to see if this was a
> version-related error. I copied the schema from 7.7.0 as many people
> said this was fine.
>
> I get the following error when trying to reindex a user's mailbox:
>
> doveadm(fran...@francisaugusto.com): Error: fts_solr: Indexing failed:
> 400 Bad Request
> doveadm(fran...@francisaugusto.com): Error: Mailbox INBOX: Transaction
> commit failed: FTS transaction commit failed: backend deinit
> doveadm(fran...@francisaugusto.com): Debug: auth-master: conn
> unix:/var/run/dovecot/auth-userdb: Disconnected: Connection closed
> (fd=10)
>
> On Solr I get this error:
>
> org.apache.solr.common.SolrException: Exception writing document id
> 210/9fd7941e8297d25d9160c3fdd3da/fran...@francisaugusto.com to the
> index; possible analysis error: cannot change field "box" from index
> options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS
>  at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:246)
>
> Parallel to this, I got some log messages on Solr before attempting to
> reindex the user (sorry for the garbadged text:
>
> Time (Local)Level   CoreLogger  Message
> 7/21/2020, 6:38:47 PM   WARN false  x:dovecot   SolrResourceLoader
>   Solr
> loaded a deprecated plugin/analysis class [solr.TrieLongField]. Please
> consult documentation how to replace it accordingly.
> 7/21/2020, 6:38:47 PM   WARN false  x:dovecot   SolrResourceLoader
>   Solr
> loaded a deprecated plugin/analysis class [solr.SynonymFilterFactory].
> Please consult documentation how to replace it accordingly.
> 7/21/2020, 6:38:47 PM   WARN false  x:dovecot   SolrResourceLoader
>   Solr
> loaded a deprecated plugin/analysis class
> [solr.WordDelimiterFilterFactory]. Please consult documentation how to
> replace it accordingly.
> 7/22/2020, 6:43:46 PM   ERROR
> false   x:dovecot   RequestHandlerBase  
> java.lang.IllegalStateException: Type
> mismatch: uid was indexed as SORTED_NUMERIC
> 7/22/2020, 6:43:46 PM   ERROR
> false   x:dovecot   HttpSolrCallnull:java.lang.IllegalStateException: 
> Type
> mismatch: uid was indexed as SORTED_NUMERIC
> 7/22/2020, 6:43:49 PM   ERROR
> false   x:dovecot   RequestHandlerBase  
> java.lang.IllegalStateException: Type
> mismatch: uid was indexed as SORTED_NUMERIC
> 7/22/2020, 6:43:49 PM   ERROR
> false   x:dovecot   HttpSolrCallnull:java.lang.IllegalStateException: 
> Type
> mismatch: uid was indexed as SORTED_NUMERIC
> 7/22/2020, 6:43:56 PM   ERROR
> false   x:dovecot   RequestHandlerBase  
> java.lang.IllegalStateException: Type
> mismatch: uid was indexed as SORTED_NUMERIC
> 7/22/2020, 6:43:56 PM   ERROR
> false   x:dovecot   HttpSolrCallnull:java.lang.IllegalSta
>
> I am tried again. I get no errors when doing a `doveadm fts rescan`, but
> get errors when trying this:
>
> doveadm index -u myu...@mydomain.com INBOX
> doveadm(myu...@mydomain.com): Error: fts_solr: Indexing failed: 400 Bad
> Request
> doveadm(myu...@mydomain.com): Error: Mailbox INBOX: Transaction commit
> failed: FTS transaction commit failed: backend deinit
>
> I guess this is a matter of waiting the reindex to be over?
>
> I get so many "Type mismatch" errors in Solr, except for this one that
> looks different and showed up after trying the doveadm index command
> above:
>
>
> ERROR true
> x:dovecot
> RequestHandlerBase
> org.apache.solr.common.SolrException: Exception writing document id
> 210/9fd7941e8297d25d9160c3fdd3da/myu...@mydomain.com to the index;
> possible analysis error: cannot change field "box" from index
> options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS
> org.apache.solr.common.SolrException: Exception writing document id
> 210/9fd7941e8297d25d9160c3fdd3da/myu...@mydomain.com to the index;
> possible analysis error: cannot change field "box" from index
> options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS
>  at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:246)
>  at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
>  at
> 

Re: Solr 8.6.1: Can't round-trip nested document from SolrJ

2020-08-24 Thread Alexandre Rafalovitch
I guess this gets into the point of whether "children" or whatever
field is used for child documents actually needs to be in the schema.
Schemaless mode creates one, but that's not a defining factor. Because
if it needs to be in the schema, then the code should reflect its
cardinality. But if it does not, then all bets are off.

Regards,
   Alex.
P.s. I added this question to SOLR-12298, as I don't think I know
enough about this part of Solr to judge.

On Mon, 24 Aug 2020 at 02:28, Munendra S N  wrote:
>
> >
> > Interestingly, I was forced to add children as an array even when the
> > child was alone and the field was already marked multivalued. It seems
> > the code does not do conversation to multi-value type, which means the
> > query code has to be a lot more careful about checking field return
> > type and having multi-path handling. That's not what Solr does for
> > string class (tested). Is that a known issue?
> >
> > https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java#L88-L89
>
> Not sure about this. Maybe we might need to check in Dev list or Slack
>
>  If I switch commented/uncommented lines around, the retrieval will fail
> > part way through, because one 'children' field is returned as array, but
> > not the other one:
>
> This might be because of these checks
> https://github.com/apache/lucene-solr/blob/e1392c74400d74366982ccb796063ffdcef08047/solr/core/src/java/org/apache/solr/response/transform/ChildDocTransformer.java#L201-L209
> but
> not sure
>
> Regards,
> Munendra S N
>
>
>
> On Sun, Aug 23, 2020 at 7:53 PM Alexandre Rafalovitch 
> wrote:
>
> > Thank you Nunedra,
> >
> > That was very helpful. I am looking forward to that documentation Jira
> > to be merged into the next release.
> >
> > I was able to get the example working by switching away from anonymous
> > children to the field approach. Which means hasChildren() call also
> > did not work. It seems the addChildren/hasChildren will need a
> > different schema, without _nest_path_ defined. I did not test.
> >
> > Interestingly, I was forced to add children as an array even when the
> > child was alone and the field was already marked multivalued. It seems
> > the code does not do conversation to multi-value type, which means the
> > query code has to be a lot more careful about checking field return
> > type and having multi-path handling. That's not what Solr does for
> > string class (tested). Is that a known issue?
> >
> > https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java#L88-L89
> >
> > If I switch commented/uncommented lines around, the retrieval will
> > fail part way through, because one 'children' field is returned as
> > array, but not the other one:
> >
> > {responseHeader={status=0,QTime=0,params={q=id:p1,fl=*,[child],wt=javabin,version=2}},response={numFound=1,numFoundExact=true,start=0,docs=[SolrDocument{id=p1,
> > name=[parent1], class=[foo.bar.parent1.1, foo.bar.parent1.2],
> > _version_=1675826293154775040, children=[SolrDocument{id=c1,
> > name=[child1], class=[foo.bar.child1], _version_=1675826293154775040,
> > children=SolrDocument{id=gc1, name=[grandChild1],
> > class=[foo.bar.grandchild1], _version_=1675826293154775040}},
> > SolrDocument{id=c2, name=[child2], class=[foo.bar.child2],
> > _version_=1675826293154775040}]}]}}
> >
> > Regards,
> >Alex.
> >
> > On Sun, 23 Aug 2020 at 01:38, Munendra S N 
> > wrote:
> > >
> > > Hi Alex,
> > >
> > > Currently, Fixing the documentation for nested docs is under progress.
> > More
> > > context is available in this JIRA -
> > > https://issues.apache.org/jira/browse/SOLR-14383.
> > >
> > >
> > https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java
> > >
> > > The child doc transformer needs to be specified as part of the fl
> > parameter
> > > like fl=*,[child] so that the descendants are returned for each matching
> > > doc. As the query q=* matches all the documents, they are returned. If
> > only
> > > parent doc needs to be returned with descendants then, we should either
> > use
> > > block join query or query clause which matches only parent doc.
> > >
> > > Another thing I noticed in the code is that the child docs are indexed as
> > > anonymous docs (similar to old syntax) instead of indexing them in the
> > new
> > > syntax. With this, the nested block will be indexed but since the schema
> > > has _nested_path

Re: PDF extraction using Tika

2020-08-24 Thread Alexandre Rafalovitch
The issue seems to be more with a specific file and at the level way
below Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing
not. I would try to narrow down which of the files it is. One way
could be to get a standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably
complain with the same error.

Regards,
   Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for
production. And both will be going away in future Solr versions. You
may have a much less brittle pipeline if you save the structured
outputs from those Tika standalone runs and then index them into Solr,
possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap
 wrote:
>
> Hello,
>
> We are using TikaEntityProcessor to extract the content out of PDF and make 
> the content searchable.
>
> When jetty is run on windows based machine, we are able to successfully load 
> documents using full import DIH(tika entity). Here PDF's is maintained in 
> windows file system.
>
> But when jetty solr is run on linux machine, and try to run DIH, we are 
> getting below exception: (Here PDF's are maintained in linux filesystem)
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
> at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
> at 
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
> ... 4 more
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Unable to read content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
> at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
> at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
> ... 6 more
> Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF 
> content
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
> at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
> ... 10 more
> Caused by: java.io.IOException: expected='>' actual='
> ' at offset 2383
> at 
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:226)
> at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:163)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at 
> 

Re: Solr 8.6.1: Can't round-trip nested document from SolrJ

2020-08-23 Thread Alexandre Rafalovitch
Thank you Nunedra,

That was very helpful. I am looking forward to that documentation Jira
to be merged into the next release.

I was able to get the example working by switching away from anonymous
children to the field approach. Which means hasChildren() call also
did not work. It seems the addChildren/hasChildren will need a
different schema, without _nest_path_ defined. I did not test.

Interestingly, I was forced to add children as an array even when the
child was alone and the field was already marked multivalued. It seems
the code does not do conversation to multi-value type, which means the
query code has to be a lot more careful about checking field return
type and having multi-path handling. That's not what Solr does for
string class (tested). Is that a known issue?
https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java#L88-L89

If I switch commented/uncommented lines around, the retrieval will
fail part way through, because one 'children' field is returned as
array, but not the other one:
{responseHeader={status=0,QTime=0,params={q=id:p1,fl=*,[child],wt=javabin,version=2}},response={numFound=1,numFoundExact=true,start=0,docs=[SolrDocument{id=p1,
name=[parent1], class=[foo.bar.parent1.1, foo.bar.parent1.2],
_version_=1675826293154775040, children=[SolrDocument{id=c1,
name=[child1], class=[foo.bar.child1], _version_=1675826293154775040,
children=SolrDocument{id=gc1, name=[grandChild1],
class=[foo.bar.grandchild1], _version_=1675826293154775040}},
SolrDocument{id=c2, name=[child2], class=[foo.bar.child2],
_version_=1675826293154775040}]}]}}

Regards,
   Alex.

On Sun, 23 Aug 2020 at 01:38, Munendra S N  wrote:
>
> Hi Alex,
>
> Currently, Fixing the documentation for nested docs is under progress. More
> context is available in this JIRA -
> https://issues.apache.org/jira/browse/SOLR-14383.
>
> https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java
>
> The child doc transformer needs to be specified as part of the fl parameter
> like fl=*,[child] so that the descendants are returned for each matching
> doc. As the query q=* matches all the documents, they are returned. If only
> parent doc needs to be returned with descendants then, we should either use
> block join query or query clause which matches only parent doc.
>
> Another thing I noticed in the code is that the child docs are indexed as
> anonymous docs (similar to old syntax) instead of indexing them in the new
> syntax. With this, the nested block will be indexed but since the schema
> has _nested_path_ defined [child] doc transformer won't return any docs.
> Anonymous child docs need parentFilter but specifying parentFilter with
> _nested_path_ will lead to error
> It is due to this check -
> https://github.com/apache/lucene-solr/blob/1c8f4c988a07b08f83d85e27e59b43eed5e2ca2a/solr/core/src/java/org/apache/solr/response/transform/ChildDocTransformerFactory.java#L104
>
> Instead of indexing the docs this way,
>
> > SolrInputDocument parent1 = new SolrInputDocument();
> > parent1.addField("id", "p1");
> > parent1.addField("name", "parent1");
> > parent1.addField("class", "foo.bar.parent1");
> >
> > SolrInputDocument child1 = new SolrInputDocument();
> >
> > parent1.addChildDocument(child1);
> > child1.addField("id", "c1");
> > child1.addField("name", "child1");
> > child1.addField("class", "foo.bar.child1");
> >
> >
> modify it to indexing
>
> > SolrInputDocument parent1 = new SolrInputDocument();
> > parent1.addField("id", "p1");
> > parent1.addField("name", "parent1");
> > parent1.addField("class", "foo.bar.parent1");
> >
> > SolrInputDocument child1 = new SolrInputDocument();
> >
> > parent1.addField("sometag", Arrays.asList(child1));
> > child1.addField("id", "c1");
> > child1.addField("name", "child1");
> > child1.addField("class", "foo.bar.child1");
> >
> > I think, once the documentation fixes get merged to master, indexing and
> searching with the nested documents will become much clearer.
>
> Regards,
> Munendra S N
>
>
>
> On Sun, Aug 23, 2020 at 5:18 AM Alexandre Rafalovitch 
> wrote:
>
> > Hello,
> >
> > I am trying to get up to date with both SolrJ and Nested Document
> > implementation and not sure where I am failing with a basic test
> > (
> > https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java
> > ).
> >
> > I am using Solr 8.6.1 with a core created with bin/solr create -c

Solr 8.6.1: Can't round-trip nested document from SolrJ

2020-08-22 Thread Alexandre Rafalovitch
Hello,

I am trying to get up to date with both SolrJ and Nested Document
implementation and not sure where I am failing with a basic test
(https://github.com/arafalov/SolrJTest/blob/master/src/com/solrstart/solrj/Main.java).

I am using Solr 8.6.1 with a core created with bin/solr create -c
solrj (schemaless is still on).

I then index a nested parent/child/grandchild document in and then
query it back. Looking at debug it seems to go out fine as a nested
doc but come back as a 3 individual ones.

Output is:
SolrInputDocument(fields: [id=p1, name=parent1,
class=foo.bar.parent1], children: [SolrInputDocument(fields: [id=c1,
name=child1, class=foo.bar.child1], children:
[SolrInputDocument(fields: [id=gc1, name=grandChild1,
class=foo.bar.grandchild1])])])
{responseHeader={status=0,QTime=1,params={q=*,wt=javabin,version=2}},response={numFound=3,numFoundExact=true,start=0,docs=[SolrDocument{id=gc1,
name=[grandChild1], class=[foo.bar.grandchild1],
_version_=1675769219435724800}, SolrDocument{id=c1, name=[child1],
class=[foo.bar.child1], _version_=1675769219435724800},
SolrDocument{id=p1, name=[parent1], class=[foo.bar.parent1],
_version_=1675769219435724800}]}}
Found 3 documents

Field: 'id' => 'gc1'
Field: 'name' => '[grandChild1]'
Field: 'class' => '[foo.bar.grandchild1]'
Field: '_version_' => '1675769219435724800'
Children: false

Field: 'id' => 'c1'
Field: 'name' => '[child1]'
Field: 'class' => '[foo.bar.child1]'
Field: '_version_' => '1675769219435724800'
Children: false

Field: 'id' => 'p1'
Field: 'name' => '[parent1]'
Field: 'class' => '[foo.bar.parent1]'
Field: '_version_' => '1675769219435724800'
Children: false

Looking in Admin UI:
* _root_ element is there and has 3 instances of 'p1' value
* _nest_path_ (of type _nest_path_ !?!) is also there but is not populated
* _nest_parent_ is not there

I am not quite sure what that means and what other scheme modification
(to the _default_) I need to do to get it to work.

I also tried to reproduce the example in the documentation (e.g.
https://lucene.apache.org/solr/guide/8_6/indexing-nested-documents.html
and  
https://lucene.apache.org/solr/guide/8_6/searching-nested-documents.html#searching-nested-documents)
but both seem to also want some undiscussed schema (e.g. with ID field
instead of id) and fail to execute against default schema.

I am kind of stuck. Anybody has a working SolrJ/Nested example or
ideas of what I missed.

Regards,
   Alex.


Re: Solr ping taking 600 seconds

2020-08-17 Thread Alexandre Rafalovitch
If this is reproducible, I would run Wireshark on the network and see what
happens at packet level.

Leaning towards firewall timing out and just starting to drop all packets.

Regards,
   Alex

On Mon., Aug. 17, 2020, 6:22 p.m. Susheel Kumar, 
wrote:

> Thanks for the all responses.
>
> Shawn - to your point both ping or select in between taking 600+ seconds to
> return as you can see below 1st ping attempt was all good and 2nd took long
> time.  Similarly for select couple of select all returned fine and then
> suddenly taking long time. I'll try to run select with shards.info to see
> if it is a problem with any particular shard but solr.log on many of the
> shard has QTime>600s entries.
>
> Heap doesn't seems to be a problem but will take a look on all the shards.
> I'll share top output as well.
>
> Thnx
>
>
> Ping
>
> server65:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/admin/ping?distrib=true'
> 
> 
> true name="status">020 name="q">{!lucene}*:*true name="df">wordTokensfalse name="rows">10all name="status">OK
> 
> server65:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/admin/ping?distrib=true'
> 
> 
> true name="status">0600123 name="q">{!lucene}*:*true name="df">wordTokensfalse name="rows">10all name="status">OK
> 
>
> select
>
>
> server67:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/select?indent=on=*:*=json=0'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":13,
> "params":{
>   "q":"*:*",
>   "indent":"on",
>   "rows":"0",
>   "wt":"json"}},
>   "response":{"numFound":62221186,"start":0,"maxScore":1.0,"docs":[]
>   }}
> server67:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/select?indent=on=*:*=json=0'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":10,
> "params":{
>   "q":"*:*",
>   "indent":"on",
>   "rows":"0",
>   "wt":"json"}},
>   "response":{"numFound":62221186,"start":0,"maxScore":1.0,"docs":[]
>   }}
> server67:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/select?indent=on=*:*=json=0'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":18,
> "params":{
>   "q":"*:*",
>   "indent":"on",
>   "rows":"0",
>   "wt":"json"}},
>   "response":{"numFound":63094900,"start":0,"maxScore":1.0,"docs":[]
>   }}
> server67:/home/kumar # curl --location --request GET '
> http://server1:8080/solr/COLL/select?indent=on=*:*=json=0'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":600093,
> "params":{
>   "q":"*:*",
>   "indent":"on",
>   "rows":"0",
>   "wt":"json"}},
>   "response":{"numFound":62221186,"start":0,"maxScore":1.0,"docs":[]
>   }}
>
> On Sat, Aug 15, 2020 at 1:41 PM Dominique Bejean <
> dominique.bej...@eolya.fr>
> wrote:
>
> > Hi,
> >
> > How long to display the solr console ?
> > What about CPU and iowait with top ?
> >
> > You should start by eliminate network issue between your solr nodes by
> > testing it with netcat on solr port.
> > http://deice.daug.net/netcat_speed.html
> >
> > Dominique
> >
> > Le ven. 14 août 2020 à 23:40, Susheel Kumar  a
> > écrit :
> >
> > > Hello,
> > >
> > >
> > >
> > > One of our Solr 6.6.2 DR cluster (target CDCR) which even doesn't have
> > any
> > >
> > > live search load seems to be taking 60 ms many times for the ping /
> > >
> > > health check calls. Anyone has seen this before/suggestion what could
> be
> > >
> > > wrong. The collection has 8 shards/3 replicas and 64GB memory and index
> > >
> > > seems to fit in memory. Below solr log entries.
> > >
> > >
> > >
> > >
> > >
> > > solr.log.26:2020-08-13 14:03:20.827 INFO  (qtp1775120226-46486) [c:COLL
> > >
> > > s:shard1 r:core_node19 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > >
> > > [COLL_shard1_replica1]  webapp=/solr path=/admin/ping
> > >
> > > params={distrib=true&_stateVer_=COLL:3032=javabin=2}
> > >
> > > hits=62569458 status=0 QTime=600113
> > >
> > > solr.log.26:2020-08-13 14:03:20.827 WARN  (qtp1775120226-46486) [c:COLL
> > >
> > > s:shard1 r:core_node19 x:COLL_shard1_replica1] o.a.s.c.SolrCore slow:
> > >
> > > [COLL_shard1_replica1]  webapp=/solr path=/admin/ping
> > >
> > > params={distrib=true&_stateVer_=COLL:3032=javabin=2}
> > >
> > > hits=62569458 status=0 QTime=600113
> > >
> > > solr.log.26:2020-08-13 14:03:20.827 INFO  (qtp1775120226-46486) [c:COLL
> > >
> > > s:shard1 r:core_node19 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > >
> > > [COLL_shard1_replica1]  webapp=/solr path=/admin/ping
> > >
> > > params={distrib=true&_stateVer_=COLL:3032=javabin=2}
> status=0
> > >
> > > QTime=600113
> > >
> > > solr.log.26:2020-08-13 14:03:20.827 WARN  (qtp1775120226-46486) [c:COLL
> > >
> > > s:shard1 r:core_node19 x:COLL_shard1_replica1] o.a.s.c.SolrCore slow:
> > >
> > > [COLL_shard1_replica1]  

Re: Multiple "df" fields

2020-08-11 Thread Alexandre Rafalovitch
I can't remember if field aliasing works with df but it may be worth a try:

https://lucene.apache.org/solr/guide/8_1/the-extended-dismax-query-parser.html#field-aliasing-using-per-field-qf-overrides

Another example:
https://github.com/arafalov/solr-indexing-book/blob/master/published/languages/conf/solrconfig.xml

Regards,
Alex

On Tue., Aug. 11, 2020, 9:59 a.m. Edward Turner, 
wrote:

> Hi all,
>
> Is it possible to have multiple "df" fields? (We think the answer is no
> because our experiments did not work when adding multiple "df" values to
> solrconfig.xml -- but we just wanted to double check with those who know
> better.) The reason we would like to do this is that we have two main field
> types (with different analyzers) and we'd like queries without a field to
> be searched over both of them. We could also use copyfields, but this would
> require us to have a common analyzer, which isn't exactly what we want.
>
> An alternative solution is to pre-process the query prior to sending it to
> Solr, so that queries with no field are changed as follows:
>
> q=value -> q=(field1:value OR field2:value)
>
> ... however, we feel a bit uncomfortable doing this though via String
> manipulation.
>
> Is there an obvious way we should tackle this problem that we are missing
> (e.g., which would be cleaner/safer and perhaps works at the Query object
> level)?
>
> Many thanks and best wishes,
>
> Edd
>


Re: wt=xml not defaulting the results to xml format

2020-08-07 Thread Alexandre Rafalovitch
You have echoParams set to all. What does that return?

Regards,
   Alex

On Fri., Aug. 7, 2020, 11:31 a.m. yaswanth kumar, 
wrote:

> Thanks for looking into this Erick,
>
>
> solr/PROXIMITY_DATA_V2/select?q=pkey:223_*=true=country_en=country_en
>
> that's what the url I am hitting, and also I made sure that initParams is
> all commented like this and also I made sure that there is no uncommneted
> section defined for initParams.
> 
> Also from the solrcloud I did make sure that I am checking the correct
> collection and verified solrconfig.xml by choosing the collection and
> browsing files within the same collection.
>
> What ever I am trying is not working other than sending wt=xml as a
> parameter while hitting the url.
>
> Thanks,
>
> On Fri, Aug 7, 2020 at 10:31 AM Erick Erickson 
> wrote:
>
> > Please show us the _exact_ URL you’re sending as well as the response
> > header, particularly the echoed params.
> >
> > This is a long shot, but also take a look at any “initParams” sections in
> > solrconfig.xml. The “wt” parameter you’ve specified in your select
> handler
> > should override anything in the  section of initParams. But
> > you’re handler is specifying wt in the defualts section, if your
> initParams
> > have the json wt specified in an invariants section that would control.
> >
> > I also recommend you look at your solrconfig through the admin UI, that
> > insures that you’re looking at the same solrconfig that your collection
> is
> > actually using. Then check your collections/ to
> > double check that your collection is using the configset you think it is.
> > This latter assumes SolrCloud.
> >
> > This is likely something in your configurations that is not as you
> expect.
> >
> > Best,
> > Erick
> >
> > > On Aug 7, 2020, at 10:19 AM, yaswanth kumar 
> > wrote:
> > >
> > > Thanks Shawn, for looking into this.
> > >
> > > I did make sure that no explicit parameter wt is being sent and also
> > > verified the logs and even that's not showing up any extra parameters.
> > But
> > > it's always taking json as a default, unless I pass it explicitly as
> > wt=xml
> > > which I don't want to do it here. Is there something else that I need
> to
> > do
> > > ?
> > >
> > > On Fri, Aug 7, 2020 at 4:23 AM Shawn Heisey 
> wrote:
> > >
> > >> How are you sending the query request that doesn't come back as xml? I
> > >> suspect that the request is being sent with an explicit wt parameter
> > set to
> > >> something other than xml. Making a query with the admin ui would do
> > this,
> > >> and it would probably default to json.
> > >>
> > >> When you make a query, assuming you haven't changed the logging
> config,
> > >> every parameter in that request can be found in the log entry for the
> > >> query, including those that come from the solrconfig.xml.
> > >>
> > >> Sorry about the top posted reply. It's the only option on this email
> > app.
> > >> My computer isn't available so I'm on my phone.
> > >>
> > >> ⁣Get TypeApp for Android ​
> > >>
> > >> On Aug 6, 2020, 21:52, at 21:52, yaswanth kumar <
> yaswanth...@gmail.com>
> > >> wrote:
> > >>> Can someone help me on this ASAP? I am using solr 8.2.0 and below is
> > >>> the
> > >>> snippet from solrconfig.xml for one of the configset, where I am
> trying
> > >>> to
> > >>> default the results into xml format but its giving me as a json
> result.
> > >>>
> > >>> 
> > >>>   
> > >>>   
> > >>> all
> > >>> 10
> > >>> 
> > >>>pkey
> > >>>xml
> > >>>   
> > >>>
> > >>> Can some one let me know if I need to do something more to always
> get a
> > >>> solr /select query results as XML??
> > >>> --
> > >>> Thanks & Regards,
> > >>> Yaswanth Kumar Konathala.
> > >>> yaswanth...@gmail.com
> > >>
> > >>
> > >
> > > --
> > > Thanks & Regards,
> > > Yaswanth Kumar Konathala.
> > > yaswanth...@gmail.com
> >
> >
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com
>


Re: Multiple fq vs combined fq performance

2020-07-09 Thread Alexandre Rafalovitch
I _think_ it will run all 3 and then do index hopping. But if you know one
fq is super expensive, you could assign it a cost
Value over 100 will try to use PostFilter then and apply the query on top
of results from other queries.


https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter

Hope it helps,
Alex.

On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey,  wrote:

> Hi all! In a collection where we have ~54 million documents we've noticed
> running a query with the following:
>
> "fq":["{!cache=false}_class:taggedTickets",
>   "{!cache=false}taggedTickets_ticketId:100241",
>   "{!cache=false}companyId:22476"]
>
> when I debugQuery I see:
>
> "parsed_filter_queries":[
>   "{!cache=false}_class:taggedTickets",
>   "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> TO 100241])",
>   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> ]
>
> runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` it
> drops down to ~5ms (it's important to note that `taggedTickets_ticketId` is
> globally unique).
>
> If we change the fqs to:
>
> "fq":["{!cache=false}_class:taggedTickets",
>   "{!cache=false}+companyId:22476 +taggedTickets_ticketId:100241"]
>
> when I debugQuery I see:
>
> "parsed_filter_queries":[
>"{!cache=false}_class:taggedTickets",
>"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])"
> ]
>
> we get the correct result back in ~5ms.
>
> My current thought is that in the slow scenario Solr is still running
> `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> TO 22476])` even though it "has the answer" from the first two fq.
>
> Am I off-base or misunderstanding how `fq` are processed?
>


Re: Shingles behavior

2020-05-20 Thread Alexandre Rafalovitch
Did you try it with 'sow' parameter both ways? I am not sure I fully
understand the question, especially with shingling on both passes
rather than just indexing one. But at least it is something to try and
is one of the difference areas between Solr and ES.

Regards,
   Alex.

On Tue, 19 May 2020 at 05:59, Radu Gheorghe  wrote:
>
> Hello Solr users,
>
> I’m quite puzzled about how shingles work. The way tokens are analysed looks 
> fine to me, but the query seems too restrictive.
>
> Here’s the sample use-case. I have three documents:
>
> mona lisa smile
> mona lisa
> mona
>
> I have a shingle filter set up like this (both index- and query-time):
>
> >  > maxShingleSize=“4”/>
>
> When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> documents back, in that order. Because the first document matches all the 
> terms:
>
> mona
> mona lisa
> mona lisa smile
> lisa
> lisa smile
> smile
>
> And the second one matches only some, and the third document only matches one.
>
> Instead, I only get the first document back. That’s because the query expects 
> all the “words” to match:
>
> > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) 
> > shingle_field:mona lisa smile)))”,
>
> The query above is generated by the Edismax query parser, when I’m using 
> “shingle_field” as “df”.
>
> Is there a way to get “any of the words” to match? I’ve tried all the options 
> I can think of:
> - different query parsers
> - q.OP=OR
> - mm=0 (or 1 or 0% or 10% or…)
>
> Nothing seems to change the parsed query from the above.
>
> I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> default, and minimum_should_match works as expected. The only difference I 
> see between the two, on the analysis side, is that tokens start at 0 in 
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> that the default “text_en”, for example, also starts at position 1.
>
> Is it just a bug that mm doesn’t work in the context of shingles? Or is there 
> a workaround?
>
> Thanks and best regards,
> Radu


Re: What is the logical order of applying sorts in SOLR?

2020-05-20 Thread Alexandre Rafalovitch
If you use sort, you are basically ignoring relevancy (unless you put
that into sort). Which you seem to know as your example uses FQ.

Do you see performance drop on non-clustered or clustered Solr?
Because, I would not be surprised if, for clustered node, all the
results need to be brought into one place to sort even if only 10 (of
say 100) would be sent back, where without sort, each node is asked
for their "top X" matches and others are never even sent. That would
be my working theory anyway, I am not deep into milti-path mode the
cluster code does.

Regards,
   Alex.

On Mon, 11 May 2020 at 15:16, Stephen Lewis Bianamara
 wrote:
>
> Hi SOLR Community,
>
> What is the order of operations which SOLR applies to sorting? I've
> observed many times and across SOLR versions that a restrictive filter with
> a sort takes an extremely long time to return, suggesting to me that the
> SORT is applied before the filter.
>
> An example situation is querying for fq:Foo=Bar vs querying for fq:Foo=Bar
> sort by Id desc. I've observed over many SOLR versions and collections that
> the former is orders of magnitude cheaper and quicker to respond, even when
> the result set is tiny (10-100).
>
> Does anyone in this forum know whether this is the default behavior and
> whether there is any way through the API or SOLR configuration to apply
> sorts after filters?
>
> Thanks,
> Stephen


Re: Large query size in Solr 8.3.0

2020-05-20 Thread Alexandre Rafalovitch
Does this actually work? This individual ID matching feels very
fragile attempt at enforcing the sort order and maybe represents an
architectural issue. Maybe you need to do some joins or graph walking
instead. Or, more likely, you would benefit from over-fetching and
just sorting on the ids on the frontend, since you have those IDs
already. You are over-fetching already anyway (rows=250), so you don't
seem to worry that much about payload size.

But, apart from that:
1) Switch from GET to POST
2) 'fl' field list and others after it are probably not very mutable,
this can go into defaults for the request handler (custom one perhaps)
3) You don't seem to use filter queries. But you also have a lot of
binary flags that may benefit from being pushed into 'fq' and improve
caching/minimize score calculations. You could also have not-cached
FQs if you think they will not be reused
4) If you have sets of params that repeat often but not always, you
could do some variable substitutions to loop them in with paramSets
5) Move the sorting query into a boost query, just for clarity of intent

Regards,
  Alex.


On Tue, 19 May 2020 at 10:16, vishal patel
 wrote:
>
>
> Which query parser is used if my query length is large?
> My query is 
> https://drive.google.com/file/d/1P609VQReKM0IBzljvG2PDnyJcfv1P3Dz/view
>
>
> Regards,
> Vishal Patel


Re: Proper way to manage managed-schema file

2020-04-13 Thread Alexandre Rafalovitch
If you are using API (which AdminUI does), the regenerated file will
loose comments and sort everything in particular order. That's just
the implementation at the moment.

If you don't like that, you can always modify the schema file by hand
and reload the core to notice the changes. You can even set the schema
to be immutable to avoid accidentally doing it.

The other option is not to have the comments in that file and then,
after first rewrite, the others are quite incremental and make it easy
to track the changes.

Regards,
   Alex.

On Mon, 6 Apr 2020 at 14:11, TK Solr  wrote:
>
> I am using Solr 8.3.1 in non-SolrCloud mode (what should I call this mode?) 
> and
> modifying managed-schema.
>
> I noticed that Solr does override this file wiping out all my comments and
> rearranging the order. I noticed there is a "DO NOT EDIT" comment. Then, what 
> is
> the proper/expected way to manage this file? Admin UI can add fields but 
> cannot
> edit existing one or add new field types. Do I keep a script of many schema
> calls? (Then how do I reset the default to the initial one, which would be
> needed before re-re-playing the schema calls.)
>
> TK
>
>


Re: how to use multiple update process chain?

2020-04-13 Thread Alexandre Rafalovitch
You can only have one chain at the time.

You can, however, create your custom URP chain to contain
configuration from all three.

Or, if you do use multiple chains that are configured similarly, you
can pull each URP into its own definition and then mix and match then
either in the chain or even per request (or in request defaults):
https://lucene.apache.org/solr/guide/8_5/update-request-processors.html#configuring-individual-processors-as-top-level-plugins

Regards,
   Alex.

On Sat, 11 Apr 2020 at 15:16, derrick cui
 wrote:
>
> Hi,
> I need to do three tasks.1. add-unkown-fields-to-the-schema2. create 
> composite key3. remove duplicate for specified field
> I defined update.chain as below, but only the first one works, the others 
> don't work. please help. thanks
> 
>   
> add-unknown-fields-to-the-schema
> composite-id
> deduplicateTaxonomy
>   
> 
>  default="${update.autoCreateFields:true}"
>  
> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>
>   
>   
> 
> 
>   
> _gl_collection
> _gl_id
>   
>   
> id
> _gl_id
>   
>   
> _gl_id
> -
>   
>   
>   
> 
> 
>   
> _gl_dp_.*
> _gl_ss_score_.*
>   
>   
>   
> 
> thanks
>


Re: Solr Admin Console hangs on Chrome

2019-12-11 Thread Alexandre Rafalovitch
Check for popup and other tracker blockers. It is possible one of the
resources has a similar name and triggers blocking. There was a thread
in early October with a similar discussion, but apart from the
blockers idea nothing else was discovered at the time.

An easy way would be to create a new Chrome profile without any
add-ons and try accessing Solr that way. This would differentiate
"Chrome vs Firefox" and "Chrome vs Chrome plugins".

Regards,
   Alex.

On Wed, 11 Dec 2019 at 07:50, A Adel  wrote:
>
> Hi - could you provide more details, such as Solr and browser network logs
> when using Chrome / other browsers?
>
> On Tue, Dec 10, 2019 at 5:48 PM Joel Bernstein  wrote:
>
> > Did a recent change to Chrome cause this?
> >
> > In Solr 8x, I'm not seeing slowness with Chrome on Mac.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Tue, Dec 10, 2019 at 8:26 PM SAGAR INGALE 
> > wrote:
> >
> > > I am also facing the same issue for v6.4.0
> > >
> > > On Wed, 11 Dec, 2019, 5:37 AM Joel Bernstein, 
> > wrote:
> > >
> > > > What version of Solr?
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Tue, Dec 10, 2019 at 5:58 PM Arnold Bronley <
> > arnoldbron...@gmail.com>
> > > > wrote:
> > > >
> > > > > I am also facing similar issue. I have also switched to other
> > browsers
> > > to
> > > > > solve this issue.
> > > > >
> > > > > On Tue, Dec 10, 2019 at 2:22 PM Webster Homer <
> > > > > webster.ho...@milliporesigma.com> wrote:
> > > > >
> > > > > > It seems like the Solr Admin console has become slow when you use
> > it
> > > on
> > > > > > the chrome browser. If I go to the query tab and execute a query,
> > > even
> > > > > the
> > > > > > default *:* after that the browser window becomes very slow.
> > > > > > I'm using chrome Version 78.0.3904.108 (Official Build) (64-bit) on
> > > > > Windows
> > > > > >
> > > > > > The work around is to use Firefox
> > > > > >
> > > > > >
> > > > > >
> > > > > > This message and any attachment are confidential and may be
> > > privileged
> > > > or
> > > > > > otherwise protected from disclosure. If you are not the intended
> > > > > recipient,
> > > > > > you must not copy this message or attachment or disclose the
> > contents
> > > > to
> > > > > > any other person. If you have received this transmission in error,
> > > > please
> > > > > > notify the sender immediately and delete the message and any
> > > attachment
> > > > > > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > > > > subsidiaries do not accept liability for any omissions or errors in
> > > > this
> > > > > > message which may arise as a result of E-Mail-transmission or for
> > > > damages
> > > > > > resulting from any unauthorized changes of the content of this
> > > message
> > > > > and
> > > > > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> > its
> > > > > > subsidiaries do not guarantee that this message is free of viruses
> > > and
> > > > > does
> > > > > > not accept liability for any damages caused by any virus
> > transmitted
> > > > > > therewith.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Click http://www.merckgroup.com/disclaimer to access the German,
> > > > French,
> > > > > > Spanish and Portuguese versions of this disclaimer.
> > > > > >
> > > > >
> > > >
> > >
> >
> --
> Sent from my iPhone


Re: Search returning unexpected matches at the top

2019-12-06 Thread Alexandre Rafalovitch
You can enable debug which will show you what matches and why. Check
the reference guide for parameters:
https://lucene.apache.org/solr/guide/8_1/common-query-parameters.html#debug-parameter

Regards,
   Alex.

On Fri, 6 Dec 2019 at 11:00, rhys J  wrote:
>
> I have a search box that is just searching every possible core, and every
> possible field.
>
> When I enter 'owl-2924-8', I expect the clt_ref_no of OWL-2924-8 to float
> to the top, however it is the third result in my list.
>
> Here is the code from the search:
>
> on_data({
>   "responseHeader":{
> "status":0,
> "QTime":31,
> "params":{
>   "hl":"true",
>   "indent":"on",
>   "fl":"debt_id, clt_ref_no",
>   "start":"0",
>   "sort":"score desc, id asc",
>   "rows":"500",
>   "version":"2.2",
>   "q":"clt_ref_no:owl\\-2924\\-8 debt_descr:owl\\-2924\\-8
> comments:owl\\-2924\\-8 reference_no:owl\\-2924\\-8 ",
>   "core":"debt",
>   "json.wrf":"on_data",
>   "urlquery":"owl-2924-8",
>   "callback":"?",
>   "wt":"json"}},
>   "response":{"numFound":85675,"start":0,"docs":[
>   {
> "clt_ref_no":"2924",
> "debt_id":"574574"},
>   {
> "clt_ref_no":"2924",
> "debt_id":"598663"},
>   {
> "clt_ref_no":"OWL-2924-8",
> "debt_id":"624401"},
>   {
> "clt_ref_no":"OWL-2924-8",
> "debt_id":"628157"},
>   {
> "clt_ref_no":"2924",
> "debt_id":"584807"},
>   {
> "clt_ref_no":"U615-2924-8",
> "debt_id":"628310"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"596713"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"624401"},
>   {
> "clt_ref_no":"OWL-2924-8/73847",
> "debt_id":"628157"},
>   {
>
> I'm not interested in having a specific search with quotes around it,
> because this is searching everything, so it's a fuzzy search. But I am
> interested in understanding why 'owl-2924-8' doesn't come out on top of the
> search.
>
> As you can see, I'm sorting by score and then id, which should take care of
> things, but it's not.
>
> Thanks,
>
> Rhys


Re: Is it possible to use the Lucene Query Builder? Is there any API to create boolean queries?

2019-12-02 Thread Alexandre Rafalovitch
What about XMLQueryParser:
https://lucene.apache.org/solr/guide/8_2/other-parsers.html#xml-query-parser

Regards,
   Alex.

On Wed, 27 Nov 2019 at 22:43,  wrote:
>
> I am trying to simulate the following query(Lucene query builder) using Solr
>
>
>
>
> BooleanQuery.Builder main = new BooleanQuery.Builder();
>
> Term t1 = new Term("f1","term");
> Term t2 = new Term("f1","second");
> Term t3 = new Term("f1","another");
>
> BooleanQuery.Builder q1 = new BooleanQuery.Builder();
> q1.add(new FuzzyQuery(t1,2), BooleanClause.Occur.SHOULD);
> q1.add(new FuzzyQuery(t2,2), BooleanClause.Occur.SHOULD);
> q1.add(new FuzzyQuery(t3,2), BooleanClause.Occur.SHOULD);
> q1.setMinimumNumberShouldMatch(2);
>
> Term t4 = new Term("f1","anothert");
> Term t5 = new Term("f1","anothert2");
> Term t6 = new Term("f1","anothert3");
>
> BooleanQuery.Builder q2 = new BooleanQuery.Builder();
> q2.add(new FuzzyQuery(t4,2), BooleanClause.Occur.SHOULD);
> q2.add(new FuzzyQuery(t5,2), BooleanClause.Occur.SHOULD);
> q2.add(new FuzzyQuery(t6,2), BooleanClause.Occur.SHOULD);
> q2.setMinimumNumberShouldMatch(2);
>
>
> main.add(q1.build(),BooleanClause.Occur.SHOULD);
> main.add(q2.build(),BooleanClause.Occur.SHOULD);
> main.setMinimumNumberShouldMatch(1);
>
> System.out.println(main.build()); // (((f1:term~2 f1:second~2
> f1:another~2)~2) ((f1:anothert~2 f1:anothert2~2 f1:anothert3~2)~2))~1   -->
> Invalid Solr Query
>
>
>
>
>
> In a few words :  ( q1 OR q2 )
>
>
>
> Where q1 and q2 are a set of different terms using I'd like to do a fuzzy
> search but I also need a minimum of terms to match.
>
>
>
> The best I was able to create was something like this  :
>
>
>
> SolrQuery query = new SolrQuery();
> query.set("fl", "term");
> query.set("q", "term~1 term2~2 term3~2");
> query.set("mm",2);
>
> System.out.println(query);
>
>
>
> And I was unable to find any example that would allow me to do the type of
> query that I am trying to build with only one solr query.
>
>
>
> Is it possible to use the Lucene Query builder with Solr? Is there any way
> to create Boolean queries with Solr? Do I need to build the query as a
> String? If so , how do I set the mm parameter in a String query?
>
>
>
> Thank you
>


Re: Prevent Solr overwriting documents

2019-11-27 Thread Alexandre Rafalovitch
Oops. And the link...
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-OptimisticConcurrency

On Wed, Nov 27, 2019, 6:24 PM Alexandre Rafalovitch, 
wrote:

> How about Optimistic Concurrency with _version_ set to negative value?
>
> You could inject that extra value in URP chain if need be.
>
> Regards,
> Alex
>
> On Wed, Nov 27, 2019, 5:41 PM Aaron Hoffer,  wrote:
>
>> We want to prevent Solr from overwriting an existing document if
>> document's
>> ID already exists in the core.
>>
>> This unit test fails because the update/overwrite is permitted:
>>
>> public void testUpdateProhibited() {
>>   final Index index = baseInstance();
>>   indexRepository.save(index);
>>   Index index0 = indexRepository.findById(INDEX_ID).get();
>>   index0.setContents("AAA");
>>   indexRepository.save(index0);
>>   Index index1 = indexRepository.findById(INDEX_ID).get();
>>   assertThat(index, equalTo(index1));
>> }
>>
>>
>> The failure is:
>> Expected: 
>> but: was > ...>
>>
>> What do I need to do prevent the second save from overwriting the existing
>> document?
>>
>> I commented out the updateHandler in the solr config file, to no effect.
>> We are using Spring Data with Solr 8.1.
>> In the core's schema, id is defined as unique  like this:
>> id
>>
>


Re: Prevent Solr overwriting documents

2019-11-27 Thread Alexandre Rafalovitch
How about Optimistic Concurrency with _version_ set to negative value?

You could inject that extra value in URP chain if need be.

Regards,
Alex

On Wed, Nov 27, 2019, 5:41 PM Aaron Hoffer,  wrote:

> We want to prevent Solr from overwriting an existing document if document's
> ID already exists in the core.
>
> This unit test fails because the update/overwrite is permitted:
>
> public void testUpdateProhibited() {
>   final Index index = baseInstance();
>   indexRepository.save(index);
>   Index index0 = indexRepository.findById(INDEX_ID).get();
>   index0.setContents("AAA");
>   indexRepository.save(index0);
>   Index index1 = indexRepository.findById(INDEX_ID).get();
>   assertThat(index, equalTo(index1));
> }
>
>
> The failure is:
> Expected: 
> but: was  ...>
>
> What do I need to do prevent the second save from overwriting the existing
> document?
>
> I commented out the updateHandler in the solr config file, to no effect.
> We are using Spring Data with Solr 8.1.
> In the core's schema, id is defined as unique  like this:
> id
>


Re: How to implement NOTIN operator with Solr

2019-11-19 Thread Alexandre Rafalovitch
I think the main question here is the compound word "credit card"
always the same? If yes, you can preprocess it during indexing to
something unique and discard (see Vincenzo's reply). You could even
copyfield and process the copy to only leave standalone word "credit"
in it, so it basically serves as a boolean presence marker.

But if it can change for every search, you have to do it during query
only. I suspect span queries can detect something like this, but don't
have a reference example. I suspect it would be either with:
*) Surround Query Parser:
https://lucene.apache.org/solr/guide/8_3/other-parsers.html#surround-query-parser
or directly with
*) XML Query Parser:
https://lucene.apache.org/solr/guide/8_3/other-parsers.html#xml-query-parser

Once you figured the syntax out, you should be able to substitute
values with variables and perhaps even push the long syntax into a
separate Query Handler, so you just pass "yes word" and "no phrase" to
Solr and have it construct longer query.

Please do let us know when you figure it out. I think other people
were interested in the similar problem before.

Regards,
   Alex.

On Tue, 19 Nov 2019 at 05:08, Raboah, Avi  wrote:
>
> In that case I got only doc1
>
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
> Sent: Tuesday, November 19, 2019 11:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to implement NOTIN operator with Solr
>
> Hi Avi,
> There are span queries, but in this case you don’t need it. It is enough to 
> simply filter out documents that are with “credit card”. Your query can be 
> something like
> +text:credit -text:”credit card”
> If you prefer using boolean operators, you can write it as:
> text:credit AND NOT text: “credit card”
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 19 Nov 2019, at 10:30, Raboah, Avi  wrote:
> >
> > I am trying to find the documents which hit this example:
> >
> > q=text:"credit" NOTIN "credit card"
> >
> > for that query I want to get all the documents which contain the term 
> > "credit" but not as part of the phrase "credit card".
> >
> > so:
> >
> > 1. I don't want to get the documents which include just "credit card".
> >
> > 2. I want to get the documents which include just "credit".
> >
> > 3. I want to get the documents which include "credit" but not as part 
> > of credit card.
> >
> >
> >
> > for example:
> >
> > doc1 text: "I want to buy with my credit in my card"
> >
> > doc2 text: "I want to buy with my credit in my credit card"
> >
> > doc3 text: "I want to buy with my credit card"
> >
> > The documents should be returned:
> >
> > doc1, doc2
> >
> > I can't find nothing about NOTIN operator implementation in SOLR docs.
> >
> >
> >
> > This electronic message may contain proprietary and confidential 
> > information of Verint Systems Inc., its affiliates and/or subsidiaries. The 
> > information is intended to be for the use of the individual(s) or 
> > entity(ies) named above. If you are not the intended recipient (or 
> > authorized to receive this e-mail for the intended recipient), you may not 
> > use, copy, disclose or distribute to anyone this message or any information 
> > contained in this message. If you have received this electronic message in 
> > error, please notify us by replying to this e-mail.
>
>
>
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.


Re: Full-text search for Solr manual

2019-11-13 Thread Alexandre Rafalovitch
Try: site:lucene.apache.org inurl:8_2 luceneMatchVersion
(8.3 does not work, seems to be not fully? indexed by google yet)

https://github.com/apache/lucene-solr/search?l=AsciiDoc=luceneMatchVersion
(latest development version only).

You can read the rendered documents (without extra processing we do),
right on GitHub:
https://github.com/apache/lucene-solr/blob/branch_8_3/solr/solr-ref-guide/src/blockjoin-faceting.adoc

Regards,
  Alex.

On Wed, 13 Nov 2019 at 17:23, Luke Miller  wrote:
>
> Thanks Alex,
>
>
>
> For your response.
>
>
>
> Unfortunately the Solr source does not ship with the source of the manual.
> (Directory /docs only contains a link to the online manual.)
>
>
>
> Google search with domain limitation does not give any results, as mentioned
> in my initial post. Any other limitation does not filter for a specific
> version. E.g. https://www.google.de/search?q=
>  //lucene.apache.org/%20luceneMatchVersion>
> "Solr%20Ref%20Guide%208.3"%20site:https://lucene.apache.org/%20luceneMatchVe
> rsion
>
>
>
> I ended up downloading the whole documentation manually:
>
> wget --timeout=1 --tries=5 --cut-dirs=3 -mkpnp -nH -P solr-8.3 -e robots=off
> https://lucene.apache.org/solr/guide/8_3/
>
>
>
> And then I have to grep. A plain PDF file would be so much more convenient!
> Of course a Solr-enabled search for the online manual would work as well.
>
>
>
> Thanks,
>
> Julian
>
>
>
>
>
> >Grep on the source of the manual (which ships with Solr source).
>
> >
>
> >Google search with domain or keywords limitations.
>
> >
>
> >Online copy searching is not powered by Solr yet. Yes, we are aware of the
>
> >irony and are discussing it.
>
> >
>
> >Regards,
>
> >Aled
>
> >
>
> >On Tue, Nov 12, 2019, 1:25 AM Luke Miller wrote:
>
> >
>
> >> Hi,
>
> >>
>
> >>
>
> >>
>
> >> I just noticed that since Solr 8.2 the Apache Solr Reference Guide is not
>
> >> available anymore as PDF.
>
> >>
>
> >>
>
> >>
>
> >> Is there a way to perform a full-text search using the HTML manual? E.g.
>
> >> I'd
>
> >> like to find every hit for "luceneMatchVersion".
>
> >>
>
> >>
>
> >>
>
> >> *   Using the integrated "Page title lookup." does not find anything
> (
>
> >> -
>
> >> sure, it only looks up page titles. )
>
> >> *   Google does not return anything either searching for:
>
> >> site:https://lucene.apache.org/solr/guide/8_3/ luceneMatchVersion
>
> >>
>
> >>
>
> >>
>
> >> Is there another search method I missed?
>
> >>
>
> >>
>
> >>
>
> >> Thanks.
>
> >>
>
> >>
>
>
>


Re: Full-text search for Solr manual

2019-11-11 Thread Alexandre Rafalovitch
Grep on the source of the manual (which ships with Solr source).

Google search with domain or keywords limitations.

Online copy searching is not powered by Solr yet. Yes, we are aware of the
irony and are discussing it.

Regards,
Aled

On Tue, Nov 12, 2019, 1:25 AM Luke Miller,  wrote:

> Hi,
>
>
>
> I just noticed that since Solr 8.2 the Apache Solr Reference Guide is not
> available anymore as PDF.
>
>
>
> Is there a way to perform a full-text search using the HTML manual? E.g.
> I'd
> like to find every hit for "luceneMatchVersion".
>
>
>
> *   Using the integrated "Page title lookup." does not find anything (
> -
> sure, it only looks up page titles. )
> *   Google does not return anything either searching for:
> site:https://lucene.apache.org/solr/guide/8_3/ luceneMatchVersion
>
>
>
> Is there another search method I missed?
>
>
>
> Thanks.
>
>


Re: Solr missing mandatory uniqueKey field: id or Unknown field

2019-11-10 Thread Alexandre Rafalovitch
You still have a mismatch between what you think the schema is
(uniqueKey=title) and message of uniqueKey being id. Focus on that. Try to
get schema FROM Solr instead og looking at one you are providing. Or look
in Admin UI what it shows for field title and for field id.

Regards,
Alex

On Mon, Nov 11, 2019, 2:30 PM Sthitaprajna, 
wrote:

>
> https://stackoverflow.com/questions/58763657/solr-missing-mandatory-uniquekey-field-id-or-unknown-field?noredirect=1#comment103816164_58763657
>
> May be this will help ? I added screenshots.
>
> On Fri, 8 Nov 2019, 22:57 Alexandre Rafalovitch, 
> wrote:
>
> > Something does not make sense, because your schema defines "title" as
> > the uniqueKey field, but your message talks about "id". Are you
> > absolutely sure that the Solr/collection you get an error for is the
> > same Solr where you are checking the schema?
> >
> > Also, do you have a bit more of the error and stack trace. I find
> > "...or Unknown field" to be very puzzling. What are you trying to do
> > when you get this error?
> >
> > Regards,
> >   Alex.
> >
> > On Sat, 9 Nov 2019 at 01:05, Sthitaprajna  >
> > wrote:
> > >
> > > Thanks,
> > >
> > > I did reload after solr configuration upload to zk
> > > Yes i push the config set to zk and i can see all my changes are on
> cloud
> > > I turned off the managed schema
> > > Yes it has, ypu could have seen it if the attachment are available. I
> > have attached again may be it will be available.
> > >
> > > On Fri, 8 Nov 2019, 21:13 Erick Erickson, 
> > wrote:
> > >>
> > >> Attachments are aggressively stripped by the mail server, so I can’t
> > see them.
> > >>
> > >> Possibilities
> > >> - you didn’t reload your core/collection
> > >> - you didn’t push the configset to Zookeeper if using SolrCloud
> > >> - you are using the managed schema, which uses a file called
> > “managed-schema” rather than classic, which uses schema.xml
> > >> - your input doesn’t really have a field “title”.
> > >> - the doc just doesn’t have a field called “title” in it when it’s
> sent
> > to Solr.
> > >>
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> > On Nov 8, 2019, at 4:41 AM, Sthitaprajna <
> > iamonlyforu.frie...@gmail.com> wrote:
> > >> >
> > >> > title
> > >>
> >
>


Re: Mixing query between different parsers

2019-11-10 Thread Alexandre Rafalovitch
Weird. Did you try echoParams=all just to see what other defaults are
picked up.

It feels like it picks up default parser and maybe default "df" value that
points to not existing text field.

Maybe enable debug too to see what it expands to.
Regards,
Alex

On Sun, Nov 10, 2019, 9:26 PM Kaminski, Adi, 
wrote:

> Hi,
> We are trying to mix the query clause and use different parsers with some
> syntax we've seen in the community.
> The idea to re-use all capabilities of the default query parser, and also
> xmlparser capabilities of position filter
> (which we asked about it previously and got some guidance from Solr
> community experts to use lucene SpanFirst for our use-case).
>
> We've tried to use a query consisting of two query parsers:
> q=_query_:"{!edismax qf=text_en}hello" _query_:"{!xmlparser} fieldName="text_en" end="5"
> boost="1.2">world"
>
> The response was HTTP 400 with message "undefined field text".
>
> Any idea what's wrong in the syntax, or maybe some direction to more
> official syntax of mixing parsers to query in different ways ?
>
> Thanks a lot in advance,
> Adi
>
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: Solr missing mandatory uniqueKey field: id or Unknown field

2019-11-08 Thread Alexandre Rafalovitch
Something does not make sense, because your schema defines "title" as
the uniqueKey field, but your message talks about "id". Are you
absolutely sure that the Solr/collection you get an error for is the
same Solr where you are checking the schema?

Also, do you have a bit more of the error and stack trace. I find
"...or Unknown field" to be very puzzling. What are you trying to do
when you get this error?

Regards,
  Alex.

On Sat, 9 Nov 2019 at 01:05, Sthitaprajna  wrote:
>
> Thanks,
>
> I did reload after solr configuration upload to zk
> Yes i push the config set to zk and i can see all my changes are on cloud
> I turned off the managed schema
> Yes it has, ypu could have seen it if the attachment are available. I have 
> attached again may be it will be available.
>
> On Fri, 8 Nov 2019, 21:13 Erick Erickson,  wrote:
>>
>> Attachments are aggressively stripped by the mail server, so I can’t see 
>> them.
>>
>> Possibilities
>> - you didn’t reload your core/collection
>> - you didn’t push the configset to Zookeeper if using SolrCloud
>> - you are using the managed schema, which uses a file called 
>> “managed-schema” rather than classic, which uses schema.xml
>> - your input doesn’t really have a field “title”.
>> - the doc just doesn’t have a field called “title” in it when it’s sent to 
>> Solr.
>>
>>
>> Best,
>> Erick
>>
>> > On Nov 8, 2019, at 4:41 AM, Sthitaprajna  
>> > wrote:
>> >
>> > title
>>


Re: Good Open Source Front End for Solr

2019-11-06 Thread Alexandre Rafalovitch
For what purpose?

Because, for example, Solr is not designed to serve direct to the browser,
just like Mysql is not. So, usually, there is a custom middleware.

On the other hand, Solr can serve as JDBC engine so you could use JDBC
frontends to explore data. Or as an engine for visualisations. Etc.

And of course, it ships with Admin UI foe internal purposes.

What's your specific use case?

Regards,
Alex

On Thu, Nov 7, 2019, 3:17 PM Java Developer,  wrote:

> Hi,
>
> What is the best open source front-end for Solr
>
> Thanks
>


Re: [Q] Ref Guide - What is Multi-Term Expansion?

2019-11-06 Thread Alexandre Rafalovitch
It mentions it in the start  paragraph "Prefix, Wildcard, Regex, etc."

So, if you search for "abc*" it expands to all terms that start from
"abc", but then not everything can handle this situation as it is a
lot of terms in the same position. So, not all analyzers can handle
that and normally it is just an automatically built subset of safe
ones.

I mark them with "(multi)" in my - very out of date, but still useful
- resource: http://www.solr-start.com/info/analyzers/

Regards,
   Alex.

On Wed, 6 Nov 2019 at 21:19, Paras Lehana  wrote:
>
> Hi Community,
>
> In Ref Guide 8.3's Understanding Analyzers subsection *Analysis for
> Multi-Term Expansion*
> ,
> the text talks about multi-term expansion and explicit use of *analyzer
> type="multiterm"*.
>
> I could not understand what exactly is multi-term expansion and what are
> the use cases for using "multiterm". *[Q1]*
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.


Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Alexandre Rafalovitch
I've done some experiments about indexing RefGuide (from source) into
Solr at: https://github.com/arafalov/solr-refguide-indexing . But the
problem was creating UI, hosting, etc.

There was also a thought (mine) of either shipping RefGuide in Solr
with pre-built index as an example or even just shipping an index with
links to the live version. Both of these were complicated because PDF
was throwing the publication schedule of. And also because we are
trying to make Solr distribution smaller, not bigger. A bit of a
catch-22 there. But maybe now it could be revisited.

Regards,
   Alex.
P.s. A personal offline copy of Solr RefGuide could certainly be built
from source. And it will become even easier to do that soon. But yes,
perhaps a compressed download of HTML version would be a nice
replacement of PDF.

On Tue, 29 Oct 2019 at 09:04, Shawn Heisey  wrote:
>
> On 10/28/2019 3:51 PM, Nicolas Paris wrote:
> > I am not very happy with the search engine embedded within the html
> > documentation I admit. Hope this is not solr under the hood :S
>
> It's not Solr under the hood.  It is done by a javascript library that
> runs in the browser.  It only searches page titles, not the whole document.
>
> The fact that a search engine has terrible search in its documentation
> is not lost on us.  We talked about what it would take to use Solr ...
> the infrastructure that would have to be set up and maintaned is
> prohibitive.
>
> We are looking into improving things in this area.  It's going a lot
> slower than we'd like.
>
> Thanks,
> Shawn


Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
Again, I think you are best to do it out of Solr.

But even of you want to get it to work in Solr, I think you start by
getting it to work directly in Tika. Then, get the missing libraries and
configuration into Solr.

Regards,
Alex

On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch  >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >  Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
I believe Tika that powers this can do so with extra libraries (tesseract?)
But Solr does not bundle those extras.

In any case, you may want to run Tika externally to avoid the
conversion/extraction process be a burden to Solr itself.

Regards,
 Alex

On Wed, Oct 23, 2019, 1:58 PM suresh pendap,  wrote:

> Hello,
> I am reading the Solr documentation about integration with Tika and Solr
> Cell framework over here
>
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> I would like to know if the can Solr Cell framework also be used to extract
> text from the image files?
>
> Regards
> Suresh
>


Re: Importing a csv file encapsulated by " creates a large copyField field of all fields combined.

2019-10-21 Thread Alexandre Rafalovitch
What command do you use to get the file into Solr? My guess that you
are somehow not hitting the correct handler. Perhaps you are sending
it to extract handler (designed for PDF, MSWord, etc) rather than the
correct CSV handler.

Solr comes with the examples of how to index CSV command.
See for example:
https://github.com/apache/lucene-solr/blob/master/solr/example/films/README.txt#L39
Also reference documentation:
https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html

Regards,
   Alex.

On Mon, 21 Oct 2019 at 13:04, rhys J  wrote:
>
> I am trying to import a csv file to my solr core.
>
> It looks like this:
>
> "user_id","name","email","client","classification","default_client","disabled","dm_password","manager"
> "A2M","Art Morse","amo...@morsemoving.com","Morse
> Moving","Morse","","X","blue0show",""
> "ABW","Amy Wiedner","amy.wied...@pyramid-logistics.com","Pyramid","","","
> ","shawn",""
> "J2P","Joan Padal","jo...@bergerallied.com","Berger","","","
> ","skew3cues",""
> "ALB","Anna Bachman","an...@bergerallied.com","Berger","","","
> ","wary#scan",""
> "B1B","Bridget Baker","bba...@reliablevan.com","Reliable","","","
> ","laps,hear",""
> "B1K","Bev Klein"," ","Nor-Cal","",""," ","pipe3hour",""
> "B1L","Beverly Leonard","bleon...@reliablevan.com","Reliable","","","
> ","gail6copy",""
> "CMD","Christal Davis","christalda...@smmoving.com","SMMoving","","","
> ","risk-pair",""
> "BEB","Bob Barnum","b...@bergerts.com","Berger","",""," ","mets=pol",""
>
> I have set up the schema via the API, and have all the fields that are
> listed on the top line of the csv file.
>
> When I finish the import, it returns no errors. But when I go to look at
> the schema, it's created a 2 fields in the managed-schema file:
>
>  name="_user_id___name___email___client___classification___default_client___disabled___dm_password___manager_"
> type="text_general"/>
>
> and
>
>   source="_user_id___name___email___client___classification___default_client___disabled___dm_password___manager_"
> dest="_user_id___name___email___client___classification___default_client___disabled___dm_password___manager__str"
> maxChars="256"/>


Re: Solr Paryload example

2019-10-21 Thread Alexandre Rafalovitch
I remember several years ago a discussion/blog post about a similar
problem. The author went through a lot of thinking and decided that
the best way to deal with a similar problem was to have Solr documents
represent different level of abstraction, more granular.

IIRC, the equivalent for your example would be to represent 'pricing'
(or even pricing/availability) as the document, not a 'product'. You
may need to duplicate product field values for that to work. But that
- apparently - allowed them to represent a lot of concepts (like
upcoming discounts, etc) easier by just adding availability dates, etc
into that final lower-level record. And, of course, allowed to update
individual store/product price by changing one record per time without
having to invalidate the cache for all other stores.

Regards,
   Alex.


On Mon, 21 Oct 2019 at 09:59, Vincenzo D'Amore  wrote:
>
> Hi Erick,
>
> thanks for getting back to me. We started to use payloads because we have
> the classical per-store pricing problem.
> Thousands of stores across and different prices.
> Then we found the payloads very useful started to use it for many reasons,
> like enabling/disabling the product for such store, save the stock
> availability, or save the other info like buy/sell price, discount rates,
> and so on.
> All those information are numbers, but stores can also be in different
> countries, I mean would be useful also have the currency and other
> attributes related to the store.
>
> Thinking about an alternative for payloads maybe I could use the dynamic
> fields, well, I know it is ugly.
>
> Consider this hypothetical case where I have two field payload :
>
> payloadPrice: [
> "store1|125.0",
> "store2|220.0",
> "store3|225.0"
> ]
>
> payloadCurrency: [
> "store1|USD",
> "store2|EUR",
> "store3|GBP"
> ]
>
> with dynamic fields I could have different fields for each document.
>
> currency_store1_s: "USD"
> currency_store2_s: "EUR"
> currency_store3_s: "GBP"
>
> But how many dynamic fields like this can I have? more than thousands?
>
> Again, I've just started to look at solr-ocrhighlighting github project you
> suggested.
> Those seems have written their own payload object type where store ocr
> highlighting information.
> It seems interesting, I'll take a look immediately.
>
> Thanks again for your time.
>
> Best regards,
> Vincenzo
>
>
> On Mon, Oct 21, 2019 at 2:55 PM Erick Erickson 
> wrote:
>
> > This is one of those situations where I know a client did it, but didn’t
> > see the code myself.
> >
> > So I can’t help much.
> >
> > Perhaps a good question at this point, though, is “why do you want to add
> > string payloads anyway”?
> >
> > This isn’t the client, but it might give you some pointers:
> >
> >
> > https://github.com/dbmdz/solr-ocrpayload-plugin/blob/master/src/main/java/de/digitalcollections/solr/plugin/components/ocrhighlighting/OcrHighlighting.java
> >
> > Best,
> > Erick
> >
> > > On Oct 21, 2019, at 6:37 AM, Vincenzo D'Amore 
> > wrote:
> > >
> > > Hi Erick,
> > >
> > > It seems I've reached a dead-point, or at least it seems looking at the
> > > code, it seems I can't  easily add a custom decoder:
> > >
> > > Looking at PayloadUtils class there is getPayloadDecoder method invoked
> > to
> > > return the PayloadDecoder :
> > >
> > >  public static PayloadDecoder getPayloadDecoder(FieldType fieldType) {
> > >PayloadDecoder decoder = null;
> > >
> > >String encoder = getPayloadEncoder(fieldType);
> > >
> > >if ("integer".equals(encoder)) {
> > >  decoder = (BytesRef payload) -> payload == null ? 1 :
> > > PayloadHelper.decodeInt(payload.bytes, payload.offset);
> > >}
> > >if ("float".equals(encoder)) {
> > >  decoder = (BytesRef payload) -> payload == null ? 1 :
> > > PayloadHelper.decodeFloat(payload.bytes, payload.offset);
> > >}
> > >// encoder could be "identity" at this point, in the case of
> > > DelimitedTokenFilterFactory encoder="identity"
> > >
> > >// TODO: support pluggable payload decoders?
> > >
> > >return decoder;
> > >  }
> > >
> > > Any advice to work around this situation?
> > >
> > >
> > > On Mon, Oct 21, 2019 at 1:51 AM Erick Erickson 
> > > wrote:
> > >
> > >> You’d need to write one. Payloads are generally intended to hold
> > numerics
> > >> you can then use in a function query to factor into the score…
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Oct 20, 2019, at 4:57 PM, Vincenzo D'Amore 
> > >> wrote:
> > >>>
> > >>> Sorry, I just realized that I was wrong in how I'm using the payload
> > >>> function.
> > >>> Give that the payload function only handles a numeric (integer or
> > float)
> > >>> payload, could you suggest me an alternative function that handles
> > >> strings?
> > >>> If not, should I write one?
> > >>>
> > >>> On Sun, Oct 20, 2019 at 10:43 PM Vincenzo D'Amore 
> > >>> wrote:
> > >>>
> >  Hi all,
> > 
> >  I'm trying to understand what I did wrong with a payload query that
> >  returns
> > 
> >  

Re: Position search

2019-10-16 Thread Alexandre Rafalovitch
Well, after some digging and trying to recall things:
1) XMLParser allows to specify a query in a different way from normal
query parameters:
https://lucene.apache.org/solr/guide/8_1/other-parsers.html#xml-query-parser
2) SpanFirst allowed to anchor the search to the start of the text and
provide the initial number of tokens to search within. It is not well
documented but apparently somebody did some tests:
https://coding-art.blogspot.com/2016/05/apache-solr-xml-query-parser.html
3) SpanFirst is actually a simpler use case of a more general matcher
(SpanPositionRangeQuery)
4) SpanPositionRangeQuery is not yet exposed in Solr, but will be in
8.3: https://issues.apache.org/jira/browse/SOLR-13663

So, I would test your example with XMLParser and SpanFirst (perhaps on
latest 8.x Solr). If that works, you have an approach for at least
initial X query and know you have an easy upgrade when 8.3 is out
(soon). Alternatively, you can play with SpanFirst and reversal of the
field.

Regards,
   Alex.
P.s. Also, SpanFirst apparently boosts matches early in the text
higher than those later. That's in the mailing list archive
discussions, which you can search on the web. E.,g.
https://lists.apache.org/thread.html/014db9dcef44a8f9641600d19cfaa528f33bac676b7ac68903537b75@%3Csolr-user.lucene.apache.org%3E

On Wed, 16 Oct 2019 at 08:17, Kaminski, Adi  wrote:
>
> Hi,
> These are really text positions.
> For example I have a document: "hello thanks for calling the support how can 
> I help you"
>
> And in the application I would like to search for documents that match 
> "thanks" NEAR "support" only in first 30 words of the document (greeting part 
> for example), and not in the middle/end part of the document.
>
> Regards,
> Adi
>
> -Original Message-
> From: Alexandre Rafalovitch 
> Sent: Wednesday, October 16, 2019 12:48 PM
> To: solr-user 
> Subject: Re: Position search
>
> So are these really text locations or rather actually sections of the 
> document. If later, can you parse out sections during indexing?
>
> Regards,
>  Alex
>
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
> wrote:
>
> > Hi,
> > Thanks for the responses.
> >
> > It's a soft boundary which is resulted by dynamic syntax from our
> > application. So may vary from different user searches, one user can
> > search some "word1" in starting 30 words, and another can search
> > "word2" in starting 10 words. The use case is to match some
> > terms/phrase in specific document places in order to identify 
> > scripts/specific word ocuurences.
> >
> > So I guess copy field won't work here.
> >
> > Any other suggestions/thoughts ?
> > Maybe some hidden position filters in native level to limit from
> > start/end of the document ?
> >
> > Thanks,
> > Adi
> >
> > -Original Message-
> > From: Tim Casey 
> > Sent: Tuesday, October 15, 2019 11:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Position search
> >
> > If this is about a normalized query, I would put the normalization
> > text into a specific field.  The reason for this is you may want to
> > search the overall text during any form of expansion phase of searching for 
> > data.
> > That is, maybe you want to know the context of up to the 120th word.
> > At least you have both.
> > Also, you may want to note which normalized fields were truncated or
> > were simply too small. This would give some guidance as to the bias of
> > the normalization.  If 95% of the fields were not truncated, there is
> > a chance you are not doing good at normalizing because you have a set
> > of particularly short messages.  So I would expect a small set of side
> > fields remarking this.  This would allow you to carry the measures
> > along with the data.
> >
> > tim
> >
> > On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
> >  > >
> > wrote:
> >
> > > Is the 100 words a hard boundary or a soft one?
> > >
> > > If it is a hard one (always 100 words), the easiest is probably copy
> > > field and in the (unstored) copy, trim off whatever you don't want
> > > to search. Possibly using regular expressions. Of course, "what's a word"
> > > is an important question here.
> > >
> > > Similarly, you could do that with Update Request Processors and
> > > clone/process field even before it hits the schema. Then you could
> > > store the extract for highlighting purposes.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On Tue, 15 Oct 2019 at 02:25, Kamins

Re: Position search

2019-10-16 Thread Alexandre Rafalovitch
So are these really text locations or rather actually sections of the
document. If later, can you parse out sections during indexing?

Regards,
 Alex

On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
wrote:

> Hi,
> Thanks for the responses.
>
> It's a soft boundary which is resulted by dynamic syntax from our
> application. So may vary from different user searches, one user can search
> some "word1" in starting 30 words, and another can search "word2" in
> starting 10 words. The use case is to match some terms/phrase in specific
> document places in order to identify scripts/specific word ocuurences.
>
> So I guess copy field won't work here.
>
> Any other suggestions/thoughts ?
> Maybe some hidden position filters in native level to limit from start/end
> of the document ?
>
> Thanks,
> Adi
>
> -Original Message-
> From: Tim Casey 
> Sent: Tuesday, October 15, 2019 11:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Position search
>
> If this is about a normalized query, I would put the normalization text
> into a specific field.  The reason for this is you may want to search the
> overall text during any form of expansion phase of searching for data.
> That is, maybe you want to know the context of up to the 120th word.  At
> least you have both.
> Also, you may want to note which normalized fields were truncated or were
> simply too small. This would give some guidance as to the bias of the
> normalization.  If 95% of the fields were not truncated, there is a chance
> you are not doing good at normalizing because you have a set of
> particularly short messages.  So I would expect a small set of side fields
> remarking this.  This would allow you to carry the measures along with the
> data.
>
> tim
>
> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch  >
> wrote:
>
> > Is the 100 words a hard boundary or a soft one?
> >
> > If it is a hard one (always 100 words), the easiest is probably copy
> > field and in the (unstored) copy, trim off whatever you don't want to
> > search. Possibly using regular expressions. Of course, "what's a word"
> > is an important question here.
> >
> > Similarly, you could do that with Update Request Processors and
> > clone/process field even before it hits the schema. Then you could
> > store the extract for highlighting purposes.
> >
> > Regards,
> >Alex.
> >
> > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi 
> > wrote:
> > >
> > > Hi,
> > > What's the recommended way to search in Solr (assuming 8.2 is used)
> > > for
> > specific terms/phrases/expressions while limiting the search from
> > position perspective.
> > > For example to search only in the first/last 100 words of the document
> ?
> > >
> > > Is there any built-in functionality for that ?
> > >
> > > Thanks in advance,
> > > Adi
> > >
> > >
> > > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> > not use, copy, disclose or distribute to anyone this message or any
> > information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: Position search

2019-10-15 Thread Alexandre Rafalovitch
Is the 100 words a hard boundary or a soft one?

If it is a hard one (always 100 words), the easiest is probably copy
field and in the (unstored) copy, trim off whatever you don't want to
search. Possibly using regular expressions. Of course, "what's a word"
is an important question here.

Similarly, you could do that with Update Request Processors and
clone/process field even before it hits the schema. Then you could
store the extract for highlighting purposes.

Regards,
   Alex.

On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi  wrote:
>
> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for 
> specific terms/phrases/expressions while limiting the search from position 
> perspective.
> For example to search only in the first/last 100 words of the document ?
>
> Is there any built-in functionality for that ?
>
> Thanks in advance,
> Adi
>
>
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.


Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Alexandre Rafalovitch
Stopwords (it was discussed on mailing list several times I recall):
The ideas is that it used to be part of the tricks to make the index
as small as possible to allow faster search. Stopwords being the most
common words
This days, disk space is not an issue most of the time and there have
been many optimizations to make stopwords less relevant. Plus, like
you said, sometimes the stopword management actively gets in the way.
Here is an interesting - if old - article about it too:
https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be

Regards,
   Alex.

On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/8/19, 1:01 PM, "David Hastings"  wrote:
>
> Another thing to add to the above,
> >
> > IT:ibm. In this case, we would want to maintain the colon and the
> > capitalization (otherwise “it” would be taken out as a stopword).
> >
> stopwords are a thing of the past at this point.  there is no benefit to
> using them now with hardware being so cheap.
>
> On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
> wrote:
>
> > If you don't want it to be touched by a tokenizer, how would the
> > protection step know that the sequence of characters you want to
> > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > protect"?
> >
> > What it sounds to me is that you may want to:
> > 1) copyField to a second field
> > 2) Apply a much lighter (whitespace?) tokenizer to that second field
> > 3) Run the results through something like KeepWordFilterFactory
> > 4) Search both fields with a boost on the second, higher-signal field
> >
> > The other option is to run CharacterFilter,
> > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> > term365". As long as it is done on both indexing and query, they will
> > still match. You may have to have a bunch of them or write some sort
> > of lookup map.
> >
> > Regards,
> >Alex.
> >
> > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hi All,
> > >
> > > This is likely a rudimentary question, but I can’t seem to find a
> > straight-forward answer on forums or the documentation…is there a way to
> > protect tokens from ANY analysis? I know things like the
> > KeywordMarkerFilterFactory protect tokens from stemming, but we have 
> some
> > terms we don’t even want our tokenizer to touch. Mostly, these are
> > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> > maintain the colon and the capitalization (otherwise “it” would be taken
> > out as a stopword).
> > >
> > > Any advice is appreciated!
> > >
> > > Thank you,
> > > Audrey
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> >
>
>


Re: Dataimport: Could not load driver: com.mysql.jdbc.Driver

2019-10-09 Thread Alexandre Rafalovitch
Try referencing the jar directly (by absolute path) with a statement
in the solrconfig.xml (and reloading the core).
The DIH example shipped with Solr shows how it works.
This will help to see if the problem with not finding the jar or something else.

Regards,
   Alex.

On Wed, 9 Oct 2019 at 09:14, Erick Erickson  wrote:
>
> Try starting Solr with the “-v” option. That will echo all the jars that are 
> loaded and the paths.
>
> Where _exactly_ is the jar file? You say “in the lib folder of my core”, but 
> that leaves a lot of room for interpretation.
>
> Are you running stand-alone or SolrCloud? Exactly how do you start Solr?
>
> Details matter
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 3:07 AM, guptavaibhav35  wrote:
> >
> > Hi,
> > Kindly help me solve the issue when I am connecting NEO4j with solr. I am
> > facing this issue in my log file while I have the jar file of neo4j driver
> > in the lib folder of my core.
> >
> > Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> > org.apache.solr.handler.dataimport.DataImportHandlerException: Could not
> > load driver: org.neo4j.jdbc.Driver Processing Document # 1
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
> >   at
> > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
> >   at
> > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
> >   at
> > org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
> >   at java.base/java.lang.Thread.run(Thread.java:835)
> > Caused by: java.lang.RuntimeException:
> > org.apache.solr.handler.dataimport.DataImportHandlerException: Could not
> > load driver: org.neo4j.jdbc.Driver Processing Document # 1
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
> >   ... 4 more
> > Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Could not load driver: org.neo4j.jdbc.Driver Processing Document # 1
> >   at
> > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
> >   at
> > org.apache.solr.handler.dataimport.JdbcDataSource.createConnectionFactory(JdbcDataSource.java:159)
> >   at
> > org.apache.solr.handler.dataimport.JdbcDataSource.init(JdbcDataSource.java:80)
> >   at
> > org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:397)
> >   at
> > org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:100)
> >   at
> > org.apache.solr.handler.dataimport.SqlEntityProcessor.init(SqlEntityProcessor.java:53)
> >   at
> > org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:77)
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:434)
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
> >   ... 6 more
> > Caused by: java.lang.ClassNotFoundException: Unable to load
> > org.neo4j.jdbc.Driver or
> > org.apache.solr.handler.dataimport.org.neo4j.jdbc.Driver
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:935)
> >   at
> > org.apache.solr.handler.dataimport.JdbcDataSource.createConnectionFactory(JdbcDataSource.java:157)
> >   ... 13 more
> > Caused by: org.apache.solr.common.SolrException: Error loading class
> > 'org.neo4j.jdbc.Driver'
> >   at
> > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:557)
> >   at
> > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:488)
> >   at
> > org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:926)
> >   ... 14 more
> > Caused by: java.lang.ClassNotFoundException: org.neo4j.jdbc.Driver
> >   at 
> > java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:436)
> >   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
> >   at
> > java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:864)
> >   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> >   at java.base/java.lang.Class.forName0(Native Method)
> >   at java.base/java.lang.Class.forName(Class.java:415)
> >   at
> > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:541)
> >   ... 16 more
> >
> >
> >
> > --
> > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Protecting Tokens from Any Analysis

2019-10-08 Thread Alexandre Rafalovitch
If you don't want it to be touched by a tokenizer, how would the
protection step know that the sequence of characters you want to
protect is "IT:ibm" and not "this is an IT:ibm term I want to
protect"?

What it sounds to me is that you may want to:
1) copyField to a second field
2) Apply a much lighter (whitespace?) tokenizer to that second field
3) Run the results through something like KeepWordFilterFactory
4) Search both fields with a boost on the second, higher-signal field

The other option is to run CharacterFilter,
(PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
term365". As long as it is done on both indexing and query, they will
still match. You may have to have a bunch of them or write some sort
of lookup map.

Regards,
   Alex.

On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Hi All,
>
> This is likely a rudimentary question, but I can’t seem to find a 
> straight-forward answer on forums or the documentation…is there a way to 
> protect tokens from ANY analysis? I know things like the 
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some 
> terms we don’t even want our tokenizer to touch. Mostly, these are 
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to 
> maintain the colon and the capitalization (otherwise “it” would be taken out 
> as a stopword).
>
> Any advice is appreciated!
>
> Thank you,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>


Re: Turn off weighted search

2019-09-30 Thread Alexandre Rafalovitch
Can you give a more detailed example, please? Including the schema bits.

There is a bunch of assumptions in here that are hard to really make
sense of. Solr works with tokens, but you are talking about letter
repetitions. Also, if you want to sort by the string, why not just use
sort parameter? It is not clear why your use case is harder than that

There is also a way to mark the field to not use frequency or length
information (omitNorms, omitTermFreqAndPositions) but I hesitate to
suggest those as you are talking about letters.

Regards,
   Alex.

On Mon, 30 Sep 2019 at 13:31,  wrote:
>
> Hello
> Is it possible to turn off the weighted search for Solr?
> I mean the results have to be presented in a pure alphabetical order, not by 
> the default weighted order. So if a certain letter appears in a word 2 times, 
> this word shouldn' t be ranked higher.
> I spent the whole day trying to find the solution in the internet, tried 
> various options with EdgeNGramFilterFactory, but it still doesn't work (the 
> results starting with "a" have to come up first and the results starting with 
> "z" come up last, currently I have "taa" as a first result, just because the 
> letter "a" appears there two times).
> Thanks in advance
> Yuri Gladkov
>


Re: URGENT Documents automatically getting deleted in SOLR 6.6.0

2019-09-26 Thread Alexandre Rafalovitch
Your system is under attack, something trying to hack into it via
Solr. Possibly a cryptominer or similar. And it is using DIH endpoint
for it.

Shawn explain the most likely cause for Solr actually deleting the
records. I would also suggest:
1) Figure out where the request is coming from and treat it as a
threat. If it is internal, they are infected. If they are external and
consistent, maybe they need to be blocked, etc.
2) Check your system has not been infected already by looking for
weird processes. I guess if you are not on Windows, that particular
line is not a threat, but the attack may have had several methods
3) If you are not using dataimporthandler, remove that from the
solrconfig.xml. Or rename (though that will loose Admin UI interface).
Or firewall block access to it

Regards,
   Alex.

On Thu, 26 Sep 2019 at 08:42, Neha  wrote:
>
> Hello SOLR Users,
>
> Today i have noticed that in my SOLR instance 6.6.0 documents are
> getting automatically deleted.
>
> In SOLR traces i found below lines and seems it is because of this.
>
>
> 2019-09-26 09:01:21.599 INFO  (qtp225493257-14) [   x:Ecotron]
> o.a.s.c.S.Request [xyz]  webapp=/solr path=/dataimport
> 

Re: Rename field in all documents from `i_itemNumber_l` to `i_itemNumber_cp_l`

2019-09-16 Thread Alexandre Rafalovitch
I don't think you can rename it in the index.

However, you may be able to rename it during the query:
https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-FieldNameAliases

Or, if you use eDisMax, during query rewriting:
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html
, see example at:
https://github.com/arafalov/solr-indexing-book/blob/master/published/languages/conf/solrconfig.xml#L20-L21

These two may need to play together actually, if you want it to be fully
transparent to the client code. See the #37 of the example above.

Regards,
   Alex.


On Mon, 16 Sep 2019 at 03:54, Sebastian Riemer  wrote:

> Dear mailing list,
>
>
>
> I would like to know:
>
>
>
> Is there some simple way to rename a field in all documents in my solr
> index?
>
>
>
> I am using a dynamic schema definition, and I’ve introduced some new
> copyField-instructions. Those make it necessary to reindex all documents.
> It would help me a great deal to be able to rename a specific field from:
>
>
>
> `i_itemNumber_l` to `i_itemNumber_cp_l`
>
>
>
> I don’t really mind to reindex all documents too, but that takes some time
> and having my (old) documents return NULL as value for the field
> `i_itemNumber_cp_l` is breaking a lot of stuff.
>
>
>
> So if there _*IS*_ a way to rename that field, that would help
> tremendously. Btw. I am using Solr 6.5.1 and I use SolrJ in my
> ApplicationLayer.
>
>
>
> Best regards and as always,
>
>
>
> Thank you so much for any input!
>
>
>
>
>
> Yours,
>
> Sebastian
>
>
>
> Mit freundlichen Grüßen
>
> Sebastian Riemer, BSc
>
>
>
>
>
> [image: logo_Littera_SC] 
> LITTERA Software & Consulting GmbH
>
> A-6060 Hall i.T., Haller Au 19a
>
> Telefon: +43(0) 50 765 000, Fax: +43(0) 50 765 118
>
> Sitz: Hall i.T., eingetragen beim Handelsgericht Innsbruck,
> Firmenbuch-Nr. FN 295807k, geschäftsführender Gesellschafter: Albert
> Unterkircher
>
>
>
> D-80637 München, Landshuter Allee 8-10
> Telefon: +49(0) 89 919 29 122, Fax: +49(0) 89 919 29 123
>
> Sitz: München, eingetragen beim Amtsgericht München
> unter HRB 103698, Geschäftsführer: Albert Unterkircher
>
> E-Mail: off...@littera.eu
> Homepage: www.littera.eu
>
>
>
> Diese Nachricht kann vertrauliche, nicht für die Veröffentlichung
> bestimmte und/oder rechtlich geschützte Informationen enthalten. Falls Sie
> nicht der beabsichtigte Empfänger sind, beachten Sie bitte, dass jegliche
> Veröffentlichung, Verbreitung oder Vervielfältigung dieser Nachricht
> strengstens untersagt ist. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, informieren Sie bitte sofort den Absender durch Anruf oder
> Rücksendung der Nachricht und vernichten Sie diese.
>
> This communication may contain information that is legally privileged,
> confidential or exempt from disclosure.  If you are not the intended
> recipient, please note that any dissemination, distribution, or copying of
> this communication is strictly prohibited.  Anyone who receives this
> message in error should notify the sender immediately by telephone or by
> return e-mail and delete this communication entirely from his or her
> computer.
>
>
>


  1   2   3   4   5   6   7   8   9   10   >