Re: Meaning of "Index" flag under properties and schema

2021-02-16 Thread Charlie Hull
This list strips attachments so you'll have to figure out another way to 
show the difference,


Cheers

Charlie

On 16/02/2021 15:16, ufuk yılmaz wrote:


There’s a collection at our customer’s site giving weird exceptions 
when a particular field is involved (asked another question detailing 
that).


When I inspected it, there’s only one difference between it and other 
dozens of fine working collections, which is,


A text_general field in all other collections has the above 
configuration without my artsy paint edits, but only that problematic 
collection has an “index” flag with indexed tokenized and stored 
checked. I never saw this “Index” flag before. What does it mean?


Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
Windows 10




--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Why Solr questions on stackoverflow get very few views and answers, if at all?

2021-02-12 Thread Charlie Hull
I've answered a few in my time, but my experience is that if you do so 
you then get emailed a whole load more questions some of which aren't 
even relevant to Solr! Also, quite a few of them are 'here is 3 pages of 
code please debug it for me no I won't tell the actual error I got'.


This is the best place to come,  also there's the IRC channel, the new 
Slack gateway to this at https://s.apache.org/solr-slack and in our own 
Relevance Slack at http://opensourceconnections.com/slack there's a 
#solr channel (as well as many others on search & relevance topics).


Solr is 'hot' (but not as hot as Elasticsearch), and search is still a 
niche business overall.


HTH

Cheers

Charlie

On 12/02/2021 10:37, ufuk yılmaz wrote:

Is it because the main place for q is this mailing list, or somewhere else 
that I don’t know?

Or Solr isn’t ‘hot’ as some other topics?

Sent from Mail for Windows 10




--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: SOLR upgrade

2021-02-09 Thread Charlie Hull

Hi Lulu,

I'm afraid you're going to have to recognise that Solr 5.2.1 is very 
out-of-date and the changes between this version and the current 8.x 
releases are significant. A direct jump is I think the only sensible 
option.


Although you could take the current configuration and attempt to upgrade 
it to work with 8.x, I recommend that you should take the chance to look 
at your whole infrastructure (from data ingestion through to query 
construction) and consider what needs upgrading/redesigning for both 
performance and future-proofing. You shouldn't just attempt a 
lift-and-shift of the current setup - some things just won't work and 
some may lock you into future issues. If you're running at large scale 
(I've talked to some people at the BL before and I know you have some 
huge indexes there!) then a redesign may be necessary for scalability 
reasons (cost and feasibility). You should also consider your skills 
base and how the team can stay up to date with Solr changes and modern 
search practice.


Hope this helps - this is a common situation which I've seen many times 
before, you're certainly not the oldest version of Solr running I've 
seen recently either!


best

Charlie

On 09/02/2021 01:14, Paul, Lulu wrote:

Hi SOLR team,

Please may I ask for advice regarding upgrading the SOLR version (our project 
currently running on solr-5.2.1) to the latest version?
What are the steps, breaking changes and potential issues ? Could this be done 
as an incremental version upgrade or a direct jump to the newest version?

Much appreciate the advice, Thank you!

Best Wishes
Lulu


**
Experience the British Library online at www.bl.uk<http://www.bl.uk/>
The British Library's latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the intended 
recipient, please delete this e-mail and notify the 
postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Solr Slack Workspace

2021-01-19 Thread Charlie Hull

Relevance Slack is open to anyone working on search & relevance - #solr is only 
one of the channels, there's lots more! Hope to see you there.

Cheers

Charlie
https://opensourceconnections.com/slack


On 16/01/2021 02:18, matthew sporleder wrote:

IRC has kind of died off,
https://lucene.apache.org/solr/community.html has a slack mentioned,
I'm on https://opensourceconnections.com/slack after taking their solr
training class and assume it's mostly open to solr community.

On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney
 wrote:

Hi all,

I did some googling and didn't find anything, but is there a Slack
workspace for Solr? I think this could be useful to expand interaction
within the community of Solr users and connect people solving similar
problems.

I'd be happy to get this setup if it does not exist already.

Justin



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Handling acronyms

2021-01-15 Thread Charlie Hull
I'm wondering if you should be using these acronyms at index time, not 
search time. It will make your index bigger and you'll have to re-index 
to add new synonyms (as they may apply to old documents) but this could 
be an occasional task, and in the meantime you could use query-time 
synonyms for the new ones.


Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.

Cheers

Charlie

On 15/01/2021 09:48, Shaun Campbell wrote:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Charlie Hull
uch higher, but we have reduced it to try to address this issue.


The behavior we see:

Solr is normally using ~3-6GB of heap and we usually have ~20GB of free
memory.  Occasionally, though, solr is not able to free up memory and the
heap usage climbs.  Analyzing the GC logs shows a sharp incline of usage
with the GC (the default CMS) working hard to free memory, but not
accomplishing much.  Eventually, it fills up the heap, maxes out the

CPUs,

and never recovers.  We have tried to analyze the logs to see if there

are

particular queries causing issues or if there are network issues to
zookeeper, but we haven't been able to find any patterns.  After the

issues

start, we often see session timeouts to zookeeper, but it doesn't appear​
that they are the cause.



Does anyone have any recommendations on things to try or metrics to look
into or configuration issues I may be overlooking?

Thanks,
Jeremy




--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Improve results/relevance

2020-10-19 Thread Charlie Hull

Hi,

A few strategies you can use:

1. First you need to know why the result has matched. Solr provides 
detailed debug info but it's not easy to interpret. Consider using 
something like www.splainer.io to give you better visibility 
(disclaimer: this is something we maintain, there are other alternatives 
including a cool Chrome plugin). You can now see where scores are being 
calculated.


2. Next you should read up on how Lucene/Solr edismax scoring works - 
remember it's a 'winner takes all' strategy. Here's a great blog by Doug 
on this 
https://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ 
. Now you should know why your results are being ordered as they are.


3. You've now got lots of options: you should set up some tests (perhaps 
use Quepid? www.quepid.com disclaimer: yes that's us too :) to monitor 
what happens as you try each and to check for side-effects. You could 
boost exact phrase matches - here's one way to do this 
http://everydaydeveloper.blogspot.com/2012/02/solr-improve-relevancy-by-boosting.html 
or you could use Querqy which gives you much more flexibility 
https://querqy.org/ (check out SMUI too as this is a great way to manage 
Querqy rules).


4. What you're doing is active search tuning for ecommerce, and this 
won't be the first example you'll come across. You should also implement 
a system for tracking these kinds of issues, what you do to fix them and 
the tests carried out: it's analogous to a bug tracker and something we 
call a 'Relevancy Register'. Otherwise you'll end up with a huge pile of 
hacks and will swiftly forget why they were implemented and what problem 
they were trying to solve!


5. We're running a blog series about ecommerce search which you might 
want to follow: 
https://opensourceconnections.com/blog/2020/07/07/meet-pete-the-e-commerce-search-product-manager/


HTH

Charlie

On 17/10/2020 04:51, Jayadevan Maymala wrote:

Hi all,

We have a catalogue of many products, including smart phones.  We use
*edismax* query parser. If someone types in iPhone 11, we are getting the
correct results. But iPhone 11 Pro is coming before iPhone 11. What options
can be used to improve this?

Regards,
Jayadevan



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr 7.7 - Few Questions

2020-10-05 Thread Charlie Hull
Nested docs would be one approach, result grouping might be another. 
Regarding JOINs, the only way you're going to know is by some 
representative testing.


Charlie

On 05/10/2020 05:49, Rahul Goswami wrote:

Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:


Hi Rahul,



In addition to the wise advice below: remember in Solr, a 'document' is

just the name for the thing that would appear as one of the results when

you search (analagous to a database record). It's not the same

conceptually as a 'Word document' or a 'PDF document'. If your source

documents are so big, consider how they might be broken into parts, or

whether you really need to index all of them for retrieval purposes, or

what parts of them need to be extracted as text. Thus, the Solr

documents don't necessarily need to be as large as your source documents.



Consider an email size 20kb with ten PDF attachments, each 20MB. You

probably shouldn't push all this data into a single Solr document, but

you *could* index them as 11 separate Solr documents, but with metadata

to indicate that one is an email and ten are PDFs, and a shared ID of

some kind to indicate they're related. Then at query time there are

various ways for you to group these together, so for example if the

query hit one of the PDFs you could show the user the original email,

plus the 9 other attachments, using the shared ID as a key.



HTH,



Charlie



On 02/10/2020 01:53, Rahul Goswami wrote:


Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than

any


attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter


You'll need to configure it in the schema for the "index" analyzer for

the


data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).
- Rahul
On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is

taking


long time. Total document size is ~200GB. As the solr commit is done as

a


part of API, the API calls are failing as document indexing is not
completed.
A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.

 1.  What is your advise on syncing such a large volume of data to

Solr KB.
What is "KB"?  I have never heard of this in relation to Solr.

 2.  Because of the search requirements, almost 8 fields are defined

as Text fields.
I can't figure out what you are trying to say with this statement.

 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such

a


large volume of data?
If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not

enough.


 4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of

the


drive. Is there any other better way to do? How to set up for the

disaster


recovery?
I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.
That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.

 5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have

Re: Solr 7.7 - Few Questions

2020-10-02 Thread Charlie Hull

Hi Rahul,

In addition to the wise advice below: remember in Solr, a 'document' is 
just the name for the thing that would appear as one of the results when 
you search (analagous to a database record). It's not the same 
conceptually as a 'Word document' or a 'PDF document'. If your source 
documents are so big, consider how they might be broken into parts, or 
whether you really need to index all of them for retrieval purposes, or 
what parts of them need to be extracted as text. Thus, the Solr 
documents don't necessarily need to be as large as your source documents.


Consider an email size 20kb with ten PDF attachments, each 20MB. You 
probably shouldn't push all this data into a single Solr document, but 
you *could* index them as 11 separate Solr documents, but with metadata 
to indicate that one is an email and ten are PDFs, and a shared ID of 
some kind to indicate they're related. Then at query time there are 
various ways for you to group these together, so for example if the 
query hit one of the PDFs you could show the user the original email, 
plus the 9 other attachments, using the shared ID as a key.


HTH,

Charlie

On 02/10/2020 01:53, Rahul Goswami wrote:

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:


On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is taking
long time. Total document size is ~200GB. As the solr commit is done as a
part of API, the API calls are failing as document indexing is not
completed.

A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.


1.  What is your advise on syncing such a large volume of data to

Solr KB.

What is "KB"?  I have never heard of this in relation to Solr.


2.  Because of the search requirements, almost 8 fields are defined

as Text fields.

I can't figure out what you are trying to say with this statement.


3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a

large volume of data?

If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not enough.


4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of the
drive. Is there any other better way to do? How to set up for the disaster
recovery?

I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.

That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.


5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this.
You're going to know a lot more about it that any of us are.

Thanks,
Shawn



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr training

2020-09-21 Thread Charlie Hull

Hi Matthew & all,

Why not? Try the code 'evenearlier' for a further discount! (Oh and we 
extended the earlybird period for another week).


Cheers

Charlie

On 17/09/2020 21:00, matthew sporleder wrote:

Is there a friends-on-the-mailing list discount?  I had a bit of sticker shock!

On Wed, Sep 16, 2020 at 9:38 AM Charlie Hull  wrote:

I do of course mean 'Group Discounts': you don't get a discount for
being in a 'froup' sadly (I wasn't even aware that was a thing!)

Charlie





On 16/09/2020 13:26, Charlie Hull wrote:

Hi all,

We're running our SolrThink Like a Relevance Engineer training 6-9 Oct
- you can find out more & book tickets at
https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/

The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm
CET and is led by Eric Pugh who co-wrote the first book on Solr and is
a Solr Committer. It's suitable for all members of the search team -
search engineers, data scientists, even product owners who want to
know how Solr search can be measured & tuned. Delivered by working
relevance engineers the course features practical exercises and will
give you a great foundation in how to use Solr to build great search.

Tthe early bird discount expires end of this week so do book soon if
you're interested! Froup discounts also available. We're also running
a more advanced course on Learning to Rank a couple of weeks later -
you can find all our training courses and dates at
https://opensourceconnections.com/training/

Cheers

Charlie

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web:www.o19s.com


--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr training

2020-09-16 Thread Charlie Hull
I do of course mean 'Group Discounts': you don't get a discount for 
being in a 'froup' sadly (I wasn't even aware that was a thing!)


Charlie

On 16/09/2020 13:26, Charlie Hull wrote:


Hi all,

We're running our SolrThink Like a Relevance Engineer training 6-9 Oct 
- you can find out more & book tickets at 
https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/


The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm 
CET and is led by Eric Pugh who co-wrote the first book on Solr and is 
a Solr Committer. It's suitable for all members of the search team - 
search engineers, data scientists, even product owners who want to 
know how Solr search can be measured & tuned. Delivered by working 
relevance engineers the course features practical exercises and will 
give you a great foundation in how to use Solr to build great search.


Tthe early bird discount expires end of this week so do book soon if 
you're interested! Froup discounts also available. We're also running 
a more advanced course on Learning to Rank a couple of weeks later - 
you can find all our training courses and dates at 
https://opensourceconnections.com/training/


Cheers

Charlie

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web:www.o19s.com



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Solr training

2020-09-16 Thread Charlie Hull

Hi all,

We're running our SolrThink Like a Relevance Engineer training 6-9 Oct - 
you can find out more & book tickets at 
https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/


The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm 
CET and is led by Eric Pugh who co-wrote the first book on Solr and is a 
Solr Committer. It's suitable for all members of the search team - 
search engineers, data scientists, even product owners who want to know 
how Solr search can be measured & tuned. Delivered by working relevance 
engineers the course features practical exercises and will give you a 
great foundation in how to use Solr to build great search.


Tthe early bird discount expires end of this week so do book soon if 
you're interested! Froup discounts also available. We're also running a 
more advanced course on Learning to Rank a couple of weeks later - you 
can find all our training courses and dates at 
https://opensourceconnections.com/training/


Cheers

Charlie

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: PDF extraction using Tika

2020-08-26 Thread Charlie Hull

Hi Joe,

Tika is pretty amazing at coping with the things people throw at it and 
I know the team behind it have added a very extensive testing framework. 
However, the reality is that malformed, huge or just plain crazy 
documents may cause crashes - PDFs are mad, you can even embed 
Javascript in them I believe, and I've also seen PDFs running to 
thousands of pages. There's *no way* to design out every possible crash, 
and it's far better to design your system to cope if necessary by 
separating the PDF processing from Solr.


Charlie

On 25/08/2020 11:46, Joe Doupnik wrote:
More properly,it would be best to fix Tika and thus not push extra 
complexity upon many many users. Error handling is one thing, crashes 
though ought to be designed out.

    Thanks,
    Joe D.

On 25/08/2020 10:54, Charlie Hull wrote:

On 25/08/2020 06:04, Srinivas Kashyap wrote:

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. 
There are around 30 pdf files and I tried indexing single file, but 
faced same error. Is it related to how PDF stored in linux?
Did you try running Tika (the same version as you're using in Solr) 
standalone on the file as Alexandre suggested?


And with regard to DIH and TIKA going away, can you share if any 
program which extracts from PDF and pushes into solr?


https://lucidworks.com/post/indexing-with-solrj/ is one example. You 
should run Tika separately as it's entirely possible for it to fail 
to parse a PDF and crash - and if you're running it in DIH & Solr it 
then brings down everything. Separate your PDF processing from your 
Solr indexing.



Cheers

Charlie



Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way 
below Solr's or possibly even Tika's:

Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
 at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045) 



Are you indexing the same files on Windows and Linux? I am guessing 
not. I would try to narrow down which of the files it is. One way 
could be to get a standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably 
complain with the same error.


Regards,
    Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended 
for production. And both will be going away in future Solr versions. 
You may have a much less brittle pipeline if you save the structured 
outputs from those Tika standalone runs and then index them into 
Solr, possibly pre-processed.


On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:

Hello,

We are using TikaEntityProcessor to extract the content out of PDF 
and make the content searchable.


When jetty is run on windows based machine, we are able to 
successfully load documents using full import DIH(tika entity). 
Here PDF's is maintained in windows file system.


But when jetty solr is run on linux machine, and try to run DIH, we
are getting below exception: (Here PDF's are maintained in linux
filesystem)

Full Import failed:java.lang.RuntimeException: 
java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 ... 4 more
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j

Re: PDF extraction using Tika

2020-08-25 Thread Charlie Hull
on is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: SOLR indexing takes longer time

2020-08-18 Thread Charlie Hull
1. You could write some code to pull the items out of Mongo and dump 
them to disk - if this is still slow, then it's Mongo that's the problem.
2. Write a standalone indexer to replace DIH, it's single threaded and 
deprecated anyway.
3. Minor point - consider whether you need to index everything every 
time or just the deltas.
4. Upgrade Solr anyway, not for speed reasons but because that's a very 
old version you're running.


HTH

Charlie

On 17/08/2020 19:22, Abhijit Pawar wrote:

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Querying solr using many QueryParser in one call

2020-07-20 Thread Charlie Hull

Hi,

It's very hard to answer questions like 'how fast/slow might this be' - 
the best way to find out is to try, e.g. to build a prototype that you 
can time. To be useful this prototype should use representative data and 
queries. Once you have this, you can try improving performance with 
strategies like the cacheing you describe.


Charlie

On 16/07/2020 18:14, harjag...@gmail.com wrote:

Hi All,
Below are question regarding querying solr using many QueryParser in one
call.
We have need to do a search by keyword and also include few specific
documents to result. We don't want to use elevator component as that would
put those mandatory documents to the top of the result. We would like to mix
those mandatory documents with organic keyword lookup result set and also
make sure those mandatory documents take part in other scoring mechanism
like bq's.On top of this we would also need to classify documents matched by
keyword lookup against mandatory docs.We ended up doing the below solr query
param to achieve it.

fl=id,title,isTermMatch:exists(query({!type=edismax qf=$qf v=blah})),score
q=({!edismax qf=$qf v=$searchQuery mm=$mm}) OR ({!edismax qf=$qf
v=$docIdQuery mm=0 sow=true})
docIdQuery=5985612 6339445 5357348
searchQuery=blah

Below are my question
1.As you can see we are calling three query parser in one call what would be
the performance implication of the search?
2.As you can see two of those queries. the one in q and one in fl is the
same. would query result cache help?
3.In general what is the implications on performance when we do a search
calling multiple query parser in a single call?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Sitecore 9.3 / Solr 8.1.1 - Zookeeper Issue

2020-07-20 Thread Charlie Hull
rj.impl.Http2SolrClient.request(Http2SolrClient.java:416)


Thanks!

Austin Kimmel
Software Developer
Vail Resorts, Inc.
303-404-1922
akim...@vailresorts.com<mailto:akim...@vailresorts.com>

VAILRESORTS(r)
EXPERIENCE OF A LIFETIME

The information contained in this message is confidential and intended only for 
the use of the individual or entity named above, and may be privileged. Any 
unauthorized review, use, disclosure, or distribution is prohibited. If you are 
not the intended recipient, please reply to the sender immediately, stating 
that you have received the message in error, then please delete this e-mail. 
Thank you.



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com




Re: SOLR Exact phrase search issue

2020-07-15 Thread Charlie Hull

On 14/07/2020 12:48, Erick Erickson wrote:

 This is almost certainly a mismatch between what you think is happening 
and what you’ve actually told Solr to do ;).
That's a great one-line explanation of 90% of the issues people face 
with Solr :-)


Charlie


Best,
Erick


On Jul 14, 2020, at 7:05 AM, Villalba Sans, Raúl  wrote:

Hello,

We have an app that uses SOLR as search engine. We have detected incorrect behavior for which we find no explanation. 
If we perform a search with the phrase "Què t’hi jugues" we do not receive any results, although we know that 
there is a result that contains this phrase. However, if we search for "Què t’hi" or for "t’hi 
jugues" we do find results, including "Què t’hi jugues ". We attach screenshots of the search tool and 
the xml of the results. We would greatly appreciate it if you could lend a hand in trying to find a solution or 
identify the cause of the problem.
  
Search 1 – “Què t’hi jugues”


  
Search 2 – “Què t’hi”

 

Search 3 – “t’hi jugues”


Best regards,
  

  
Raül Villalba Sans

Delivery Centers – Centros de Producción
  
Parque de Gardeny, Edificio 28

25071 Lleida, España
T +34 973 193 580
  




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



I Became a Solr Committer in 4662 Days. Here’s how you can do it faster!

2020-07-10 Thread Charlie Hull

Hi all,

Thought you might enjoy Eric's blog, it's taken him a while! Some good 
hints here for those of you interested in contributing more to Solr.


https://opensourceconnections.com/blog/2020/07/10/i-became-a-solr-committer-in-4662-days-heres-how-you-can-do-it-faster/

Cheers

Charlie

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: solr fq with contains not returning any results

2020-06-25 Thread Charlie Hull
It looks like something in your query analyzer chain is turning the 
wildcard operators '*' into the word 'star' - maybe you need to dig into 
your analyzers, synonym lists etc. and see where this is happening. The 
admin/analysis panel that Erick suggests lets you enter data and see 
what happens once your analyzer chain has processed it - have a go and 
see what happens. Either that or newer Solr displays the debug 
information differently, but I don't have two versions here to compare...


Charlie

On 24/06/2020 19:18, yaswanth kumar wrote:

Thanks Erick,

I have now added =query and found a diff between old solr and new solr

new solr (8.2) which is not giving results is as follows

"debug":{
 "rawquerystring":"*:*",
 "querystring":"*:*",
 "parsedquery":"MatchAllDocsQuery(*:*)",
 "parsedquery_toString":"*:*",
 "explain":{},
 "QParser":"LuceneQParser",
 "filter_queries":["auto_nsallschools:*bostonschool*"],
 "parsed_filter_queries":["auto_nsallschools:_star_bostonschool_star_"],

Where as solr 5.5 which is getting me the results is as follows

"debug":{
 "rawquerystring":"*:*",
 "querystring":"*:*",
 "parsedquery":"MatchAllDocsQuery(*:*)",
 "parsedquery_toString":"*:*",
 "explain":{},
 "QParser":"LuceneQParser",
 "filter_queries":["auto_nsallschools:*bostonschool*"],
 "parsed_filter_queries":["auto_nsallschools:*bostonschool*"],

I know in schema there are analyzer against this field but not getting on
why its making differences here.

Thanks,

On Wed, Jun 24, 2020 at 9:24 AM Erick Erickson 
wrote:


You need to do several things to track down why.

First, use something (admin UI, terms query, etc) to see
exactly what’s in your index. The admin/analysis screen is useful here.

Second, aldd =query to the query on both machines and
see what the actual parsed query looks like.

Comparing those should give you a clue.

Best,
Erick


On Jun 24, 2020, at 9:20 AM, yaswanth kumar 

wrote:

"nsallschools":["BostonSchool"]

That's how the data is stored against the field.

We have a functionality where we can do "Starts with, Contains, Ends

with";

Also if you look at the above schema we are using






Also the strange part is that its working fine in Solr 5.5 but not in

Solr

8.2 any thoughts??

Thanks,

On Wed, Jun 24, 2020 at 3:15 AM Jörn Franke 

wrote:

I don’t know your data, but could it be that you tokenize differently ?

Why do you do the wildcard search at all? Maybe a different tokenizing
strategy can bring you more effieciently results? Depends on what you

need

to achieve of course ...


Am 24.06.2020 um 05:37 schrieb yaswanth kumar :

I am using solr 8.2

And when trying to do fq=auto_nsallschools:*bostonschool*, the data is

not

being returned. But if I do the same in solr 5.5 (which I already have

and

we are in process of migrating to 8.2 ) its returning results.

if I do fq=auto_nsallschools:bostonschool
or
fq=auto_nsallschools:bostonschool* its returning results but when I try
with contains like described above or

fq=auto_nsallschools:*bostonschool

(ends with) it's not returning any results.

The field which we are already using is a copy field and multi valued,

am I

doing something wrong? or does 8.2 need some adjustment in the configs?

Here is the schema


stored="true"

multiValued="true"/>

indexed="true"

stored="false" multiValued="true"/>




 
   
   
   
 
   



 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
 
   

Thanks,

--
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com


--
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Not all EML files are indexing during indexing

2020-06-03 Thread Charlie Hull
I think the OP is indexing flat files, not web pages (but otherwise, I 
agree with you that Scrapy is great - I know some of the people behind 
it too and they're a good bunch).


Charlie

On 02/06/2020 16:41, Walter Underwood wrote:

On Jun 2, 2020, at 7:40 AM, Charlie Hull  wrote:

If it was me I'd probably build a standalone indexer script in Python that did 
the file handling, called out to a separate Tika service for extraction, posted 
to Solr.

I would do the same thing, and I would base that script on Scrapy (https://scrapy.org 
<https://scrapy.org/>). I worked on a Python-based web spider for about ten 
years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Not all EML files are indexing during indexing

2020-06-02 Thread Charlie Hull
Ah OK. I haven't used SimplePostTool myself and I note the docs say 
"View this not as a best-practice code example, but as a standalone 
example built with an explicit purpose of not having external jar 
dependencies."


I'm wondering if it's some kind of synchronisation issue between new 
files arriving in the folder and being picked up by your Powershell 
script. Hard to say really without seeing all the code...perhaps take 
out the Tika & Solr parts for now and verify the rest of your code 
really can spot every new or updated file that arrives?


If it was me I'd probably build a standalone indexer script in Python 
that did the file handling, called out to a separate Tika service for 
extraction, posted to Solr.


Cheers


Charlie





On 02/06/2020 14:48, Zheng Lin Edwin Yeo wrote:

Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull  wrote:


Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any
code for actually walking a folder, extracting the content from .eml
files and pushing this data into its index, so I'm guessing you've built
something external?

Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:

Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the

indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin


--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Not all EML files are indexing during indexing

2020-06-01 Thread Charlie Hull

Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any 
code for actually walking a folder, extracting the content from .eml 
files and pushing this data into its index, so I'm guessing you've built 
something external?


Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:

Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Haystack is Back! Not just one - but three search conferences

2020-05-20 Thread Charlie Hull

Hi all,

So there's no Haystack in Charlottesville this year - but we've done our 
very best to bring you some of the talks and training we planned online 
- find out more at 
https://opensourceconnections.com/blog/2020/05/18/haystack-is-back-go-virtual-for-relevant-search-talks-workshops-discussions-training/


One part of this is three conferences, Berlin Buzzwords, Haystack and 
MICES, have come together for a week of online talks, workshops, panels 
and discussions. There's lots of great search related content including 
Uwe Schindler on Lucene 9, Doug Turnbull & Trey Grainger on AI-Powered 
Search, Tim Allison of NASA on genetic algorithms, a panel on result 
diversity, a workshop on the opensource ecommerce search ecosystem...do 
check it out at www.berlinbuzzwords.de . I'm running a Lightning Talks 
session too (let me know if you've got a talk).


Cheers

Charlie

--

Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Combined virtual conference announced with content on Solr, search & relevance

2020-05-07 Thread Charlie Hull
The teams behind Berlin Buzzwords <https://berlinbuzzwords.de/>, 
Haystack <http://www.haystackconf.com> the search relevance conference, 
and MICES <http://mices.co> the ecommerce search event are happy to 
announce a week of virtual talks, panel discussions, workshops and 
training sessions covering themes of search, scale, store!


To be held between *7th-12th June 2020* , this collaboration will bring 
together the best of the planned sessions from three annual conferences 
postponed or cancelled due to COVID-19 and make them available across 
the world. We aim to support our three communities and to bring them 
together to share knowledge, expertise and experiences. Read more here. 
<https://berlinbuzzwords.de/news/registration-online-event-now-available>


Tickets are on sale now at https://berlinbuzzwords.de/tickets - see you 
there (virtually) we hope.


Cheers

Charlie

--

Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Use TopicStream as percolator

2020-05-01 Thread Charlie Hull
Great! I ran Flax, where we created Luwak, up to last year when we 
merged with OSC, so this is great to see.


Did you know we donated Luwak to Lucene recently? 
https://issues.apache.org/jira/browse/LUCENE-8766


It would be great to work this up into a Solr contrib module

Charlie
..
Berlin Buzzwords, MICES and Haystack come together for an awesome merged 
online search conference! Check out www.haystackconf.com for news


On 01/05/2020 09:56, SOLR4189 wrote:

Hi everyone,

I wrote SOLR Update Processor that wraps Luwak library and implements Saved
Searches a la ElasticSearch Percolator.

https://github.com/SOLR4189/solcolator

for anyone who wants to use.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr indexing with Tika DIH - ZeroByteFileException

2020-04-23 Thread Charlie Hull
If users can upload any PDF, including broken or huge ones, and some 
cause a Tika error, you should decouple Tika from Solr and run it as a 
separate process to extract text before indexing with Solr. Otherwise 
some of what is uploaded *will* break Solr.

https://lucidworks.com/post/indexing-with-solrj/ has some good hints.

Cheers

Charlie

On 11/06/2019 15:27, neilb wrote:

Hi, while going through solr logs, I found data import error for certain
documents. Here are details about the error.

Exception while processing: file document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to read content Processing Document # 7866
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must
have > 0 bytes
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)


How do I know which document(document name with path) is #7866? And how do I
ignore ZeroByteFileException as document network share is not in my control.
Users can upload any size pdfs to it.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: solr as a general search engine

2020-04-21 Thread Charlie Hull

Hi Matt,

On 21/04/2020 13:41, matthew sporleder wrote:

Sorry for the vague question and I appreciate the book recommendations
-- I actually think I am mostly confused about suggest vs spellcheck
vs morelikethis as they relate to what I referred to as "expected"
behavior (like from a typed-in search bar).
Suggest - here's some results that might match based on what you've 
typed so far (usually powered by a behind-the-scenes search of the index 
with some restrictions). Note the difference between this and 
autocompletion, which suggests complete search terms from the index 
based on the partial word you've typed so far.
Spellcheck - The word you typed isn't anywhere in the index, so I've 
used an edit distance algorithm to suggest a few words you might have 
meant that are in the index (note this isn't spelling correction as the 
engine doesn't necessarily have the corrected form in its index)
Morelikethis - here's some results that share some characteristics with 
the document you're looking at, e.g. they're indexed by some of the same 
terms


For reference we have been using solr as search in some form for
almost 10 years and it's always been great in finding things based on
clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
(actually really really good at this) but has always fallen short
(imho and also our fault, obviously) in the "typed in a search query"
experience.
I'm guessing you're bumping into the problem that most people type very 
little into a search bar, and expect the engine to magically know what 
they meant. It doesn't of course, so it has to suggest some ways for the 
user to tell it more specific information - facets for example, or some 
of the features above.


We are in the midst of re-developing our internal content ranking
system and it has me grasping on how to *really* elevate our game in
terms of giving an excellent human-driven discovery vs our current
behavior of: "here is everything we have that contains those words,
minus ones I took out".


I think you need to look at several angles:

- What defines a 'good' result in your world/for your content?
- Who judges this? How do you record this? Human/clicks/both?
- What Solr features *could* help - and how are you going to test that 
they actually do using the two lines above?


We think that building up this measurement-driven, experimental process 
is absolutely key to improving relevance.


Cheers

Charlie




On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull  wrote:

Hi Matt,

Are you looking for a good, general purpose schema and config for Solr?
Well, there's the problem: you need to define what you mean by general
purpose. Every search application will have its own requirements and
they'll be slightly different to every other application. Yes, there
will be some commonalities too. I guess by "as a human might expect one
to behave" you mean "a bit like how Google works" but unfortunately
Google is a poor example: you won't have Google's money or staff or
platform in your company, nor are you likely to be building a
massive-scale web search engine, so at best you can just take
inspiration from it, not replicate it.

In practice, what a lot of people do is start with an example setup
(perhaps from one of the examples supplied with Solr, e.g.
'techproducts') and adapt it: or they might start with the Solr
configset provided by another framework, e.g. Drupal (yay! Pink
Ponies!). Unfortunately the standard example configsets are littered
with comments that say things like 'Here is how you *could* do XYZ but
please don't actually attempt it this way' and other config sections
that if you un-comment them may just get you into further trouble. It's
grown rather than been built, and to my mind there's a good argument for
starting with an absolutely minimal Solr configset and only adding
things in as you need them and understand them (see
https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html
for some background and a great presentation from Alex Rafalovitch on
the examples).

You're also going to need some background on *why* all these features
should be used, and for that I'd recommend my colleague Doug's book
Relevant Search https://www.manning.com/books/relevant-search - or maybe
our training (quick plug: we're running some online training in a couple
of weeks
https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )

Hope this helps,

Cheers

Charlie

On 20/04/2020 23:43, matthew sporleder wrote:

Is there a comprehensive/big set of tips for making solr into a
search-engine as a human would expect one to behave?  I poked around
in the nutch github for a minute and found this:
https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
   but I was wondering if I was missing a very obvious document
somewhere.

I guess I'm looking for things like:
use suggester here, use spell

Re: solr as a general search engine

2020-04-21 Thread Charlie Hull

Hi Matt,

Are you looking for a good, general purpose schema and config for Solr? 
Well, there's the problem: you need to define what you mean by general 
purpose. Every search application will have its own requirements and 
they'll be slightly different to every other application. Yes, there 
will be some commonalities too. I guess by "as a human might expect one 
to behave" you mean "a bit like how Google works" but unfortunately 
Google is a poor example: you won't have Google's money or staff or 
platform in your company, nor are you likely to be building a 
massive-scale web search engine, so at best you can just take 
inspiration from it, not replicate it.


In practice, what a lot of people do is start with an example setup 
(perhaps from one of the examples supplied with Solr, e.g. 
'techproducts') and adapt it: or they might start with the Solr 
configset provided by another framework, e.g. Drupal (yay! Pink 
Ponies!). Unfortunately the standard example configsets are littered 
with comments that say things like 'Here is how you *could* do XYZ but 
please don't actually attempt it this way' and other config sections 
that if you un-comment them may just get you into further trouble. It's 
grown rather than been built, and to my mind there's a good argument for 
starting with an absolutely minimal Solr configset and only adding 
things in as you need them and understand them (see 
https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html 
for some background and a great presentation from Alex Rafalovitch on 
the examples).


You're also going to need some background on *why* all these features 
should be used, and for that I'd recommend my colleague Doug's book 
Relevant Search https://www.manning.com/books/relevant-search - or maybe 
our training (quick plug: we're running some online training in a couple 
of weeks 
https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )


Hope this helps,

Cheers

Charlie

On 20/04/2020 23:43, matthew sporleder wrote:

Is there a comprehensive/big set of tips for making solr into a
search-engine as a human would expect one to behave?  I poked around
in the nutch github for a minute and found this:
https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
  but I was wondering if I was missing a very obvious document
somewhere.

I guess I'm looking for things like:
use suggester here, use spelling there, use DocValues around here, DIY
pagerank, etc

Thanks,
Matt



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Indexing data from multiple data sources

2020-04-20 Thread Charlie Hull
The link you quote is Sematext's mirror of the Apache solr-user mailing 
list. There are others also providing copies of this list. As the cat is 
very much out of the bag your best course of action is to change all the 
logins and passwords that have been leaked and review your security 
procedures.


Cheers

Charlie

On 18/04/2020 13:27, RaviKiran Moola wrote:

Hi,
Greetings of the day!!!

Unfortunately we have enclosed our database source details in the Solr 
community post while sending our queries to solr support as mentioned 
in the below mail.


We find that it has been posted with this link 
https://sematext.com/opensee/m/Solr/eHNlswSd1vD6AF?subj=RE+Indexing+data+from+multiple+data+sources


As it is open to the world, what we are requesting here is, could you 
please remove that post as-soon-as possible before it creates any 
sucurity issues for us.


Your help is very very appreciable!!!

FYI.
Here I'm attaching the below screenshot




Thanks & Regards,

Ravikiran Moola



*From:* RaviKiran Moola
*Sent:* Friday, April 17, 2020 9:13 PM
*To:* solr-user@lucene.apache.org 
*Subject:* RE: Indexing data from multiple data sources
Hi,

Greetings!!!

We are working on indexing data from multiple data sources (MySQL & 
MSSQL) in a single collection. We specified data source details like 
connection details along with the required fields for both data 
sources in a single data config file, along with specified required 
fields details in the managed schema and here fetching the same 
columns from both data sources by specifying the common “unique key”.


Unable to index the data from the data sources using solr.

Here I’m attaching the data config file and screenshot.

Data config file:

 url="jdbc:mysql://182.74.133.92:3306/ra_dev" user="devuser" 
password="Welcome_009" batchSize="1" />
 driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" 
url="jdbc:sqlserver://182.74.133.92;databasename=BB_SOLR" 
user="matuser" password="MatDev:07"/>

  
  

   
   
  

   
   
  

 



Thanks & Regards,

Ravikiran Moola

+91-9494924492




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: FW: Solr proximity search highlighting issue

2020-04-02 Thread Charlie Hull
I may be wrong here, but the problem may be that the match was on your 
terms pos1 and pos2 (you don't need the pos3 term to match, due to the 
OR operator) and thus that's what's been highlighted.


There's a hl.q parameter that lets you supply a different query for 
highlighting to the one you're using for searching, perhaps that could 
have a different and more forgiving pattern that made sure all your 
terms were highlighted?


Also, the XML didn't come through as this list strips attachments.

Best

Charlie

On 31/03/2020 19:27, Anil Shingala wrote:


Hello Dev Team,

I found some problem in highlighting module. Not all the search terms 
are getting highlighted.


Sample query: q={!complexphrase+inOrder=true}"pos1 (pos2 OR 
pos3)"~30=true


Indexed text: "pos1 pos2 pos3 pos4"

please find attached response xml screen shot from solr.

You can see that only two terms are highlighted like, "pos1 
pos2 pos3 pos4"


The scenario is same in Solr source code since long time (I have 
checked in Solr version 4 to version 7). The scenario is when term 
positions are in-order in document and query both.


Please let me know your view on this.

Regards,

Anil Shingala

*Knovos*
10521 Rosehaven Street, Suite 300 | Fairfax, VA 22030 (USA)
Office +1 703.226.1505

Main +1 703.226.1500 | +1 877.227.5457

/ashing...@knovos.com/ 
<mailto:ashing...@knovos.com>/_|_//www.knovos.com/ 
<http://www.knovos.com/>


Washington DC | New York | London | Paris | Gandhinagar | Tokyo

/Knovos was formerly also known as Capital Novus or Capital Legal 
Solutions. The information contained in this email message may be 
confidential or legally privileged. If you are not the intended 
recipient, please advise the sender by replying to this email and by 
immediately deleting all copies of this message and any attachments. 
Knovos, LLC is not authorized to practice law./




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr Instance Migration - Server Access

2020-03-26 Thread Charlie Hull
If you can get the server login details you should be able to copy the 
Solr installation and its configuration. If not, then Solr itself 
doesn't provide any way to get them - it's just a search engine, it's 
not responsible for securing a server in any way.


Charlie

On 26/03/2020 02:13, Landon Cowan wrote:

Hello!  I’m working on a website for a client that was migrated from another 
website development company.  The previous company used Solr to build out the 
site search – but they did not send us the server credentials.  The developers 
who built the tool are no longer with the company – is there a process we 
should follow to secure the credentials?  I worry we may need to rebuild the 
feature from the ground up.





--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: How to get boosted field and values?

2020-03-25 Thread Charlie Hull
Try splainer.io - it parses the Debug output to show in detail how the 
scores are calculated (disclaimer, I work for OSC who created it - but 
it's free & open source of course ).


Charlie

On 23/03/2020 01:26, Taisuke Miyazaki wrote:

The blog looks like it's going to be useful from now on, so I'll take a
look.Thank you.

What I wanted, however, was a way to know what field was boosted as a
result.
But I couldn't find a way to do that, so instead I tried to get the field
and value out of the resulting score by putting a binary bit on the
field/value pair.
It doesn't really matter to me whether you do it additively or
multiplicatively, as it's good to know the field boosted as a result.

Do you see what I mean?


2020年3月20日(金) 18:56 Alessandro Benedetti :


Hi Taisuke,
there are various ways of approaching boosting and scoring in Apache Solr.
First of all you must decide if you are interested in multiplicative or
additive boost.
Multiplicative will multiply the score of your search result by a certain
factor while the additive will just add the factor to the final score.

Using advanced query parsers such as the dismax and edismax you can use the
:
*boost* parameter - multiplicative - takes function in input -

https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-TheboostParameter
*bq*(boost query) - additive -

https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter
*bf*(boost function) - additive -

https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter

This blog post is old but should help :
https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/

Then you can boost fields or even specific query clauses:

  1)

https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqf_QueryFields_Parameter

2) q= features:2^1.0 AND features:3^5.0

1.0 is the default, you are multiplying the score contribution of the term
by 1.0, so no effect.
features:3^5.0 means that the score contribution of a match for the term
'3' in the field 'features' will be multiplied by 5.0 (you can also see
that enabling debug=results

Finally you can force the score contribution of a term to be a constant,
it's not recommended unless you are truly confident you don't need other
types of scoring:
q= features:2^=1.0 AND features:3^=5.0

in this example your document  id: 3 will have a score of 6.0

Not sure if this answers your question, if not feel free to elaborate more.

Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Thu, 19 Mar 2020 at 11:18, Taisuke Miyazaki 
I'm using Solr 7.5.0.
I want to get boosted field and values per documents.

e.g.
documents:
   id: 1, features: [1]
   id: 2, features: [1,2]
   id: 3, features: [1,2,3]

query:
   bq: features:2^1.0 AND features:3^1.0

I expect results like below.
boosted:
   - id: 2
 - field: features, value: 2
   - id: 3
 - field: features, value: 2
 - field: features, value: 3

I have an idea that set boost score like bit-flag, but it's not good I
think because I must send query twice.

bit-flag:
   bq: features:2^2.0 AND features:3^4.0
   docs:
 - id: 1, score: 1.0(0x001)
 - id: 2, score: 3.0(0x011) # have feature:2(2nd bit is 1)
 - id: 3, score: 7.0(0x111) # have feature:2 and feature:3(2nd and 3rd
bit are 1)
check score value then I can get boosted field.

Is there a better way?



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: FW: SOLR version 8 bug???

2020-03-24 Thread Charlie Hull
achedChain.doFilter(ServletHandler.java:1596)\n\tat 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)\n\tat 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat 
org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat 
org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat 
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)\n\tat 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)\n\tat 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)\n\tat java.lang.Thread.run(Thread.java:748)\n", 
"code":500}} in Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend->search() (line 1600 of 
/srv/www/dcfinternet/phil/modules/composer/search_api_solr/src/Plugin/search_api/backend/SearchApiSolrBackend.php).



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com




Haystack US tickets on sale!

2020-02-27 Thread Charlie Hull

Hi all,

Very happy to announce that Haystack US 2020, the search relevance 
conference, is now open for business! www.haystackconf.com for details 
of the event running during the week of April 27th in Charlottesville, 
including associated training. We have a fantastic lineup of speakers 
due to be published soon, there will be fun social events, book signings 
and more. Earlybird discounts are active until the end of March.


(If you can't wait that long we're also running some Solr training in 
March in London 
https://www.eventbrite.co.uk/e/think-like-a-relevance-engineer-solr-march-2020-london-uk-tickets-92942813457 
and holding our London Solr Meetup that same week 
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/)


Cheers

Charlie

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Mongolian language in Solr

2020-02-13 Thread Charlie Hull

Hi,

There's no Mongolian stemmer in Snowball, the stemmer project Lucene 
uses. I found one paper discussing how one might lemmatize Mongolian:

https://www.researchgate.net/publication/220229332_A_lemmatization_method_for_Mongolian_and_its_application_to_indexing_for_information_retrieval
https://dl.acm.org/doi/10.1016/j.ipm.2009.01.008
but no actual code. Of course, you could use Snowball to build your own 
stemmer. https://snowballstem.org/


I did have more success finding Mongolian stopwords 
https://github.com/elastic/elasticsearch/issues/40434 - someone over in 
Elasticsearch land seems to have the same problem as you do.


Best

Charlie

On 12/02/2020 11:41, Samir Joshi wrote:

Hi,

Is it possible to get a Mongolian language in Solr indexing?

Regards,

Samir Joshi

VFS GLOBAL
EST. 2001 | Partnering Governments. Providing Solutions.

10th Floor, Tower A, Urmi Estate, 95, Ganpatrao Kadam Marg, Lower Parel (W), 
Mumbai 400 013, India
Mob: +91 9987550070 | sami...@vfsglobal.com<mailto:sami...@vfsglobal.com> | 
www.vfsglobal.com<http://www.vfsglobal.com/>



--
Care4Green: Please consider the environment before printing this e-mail
--
This message contains information that may be privileged or confidential and is 
the property of the VFS Global Group. It is intended only for the person to 
whom it is addressed. Any unauthorised printing, copying, disclosure, 
distribution or use of this message or any part thereof is strictly forbidden. 
If you are not the intended recipient, you are not authorised to read, print, 
retain, copy, disseminate, distribute, or use this message or any part thereof. 
If you receive this message in error, please notify the sender immediately and 
delete all copies of this message. VFS Global Group has taken reasonable 
precaution to ensure that any attachment to this e-mail has been swept for 
viruses. However, we do not accept liability for any direct or indirect damage 
sustained as a result of software viruses and would advise that you conduct 
your own virus checks before opening any attachment. VFS Global Group does not 
guarantee the security of any information transmitted electronically and is not 
liable for the proper, timely and complete transmission thereof.
--




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com




Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr

2020-02-04 Thread Charlie Hull

Hi all,

You have until this Friday to submit a talk to Haystack! Very much 
looking forward to your submissions.


Charlie

On 27/01/2020 21:53, Doug Turnbull wrote:

Just an update the CFP was extended to Feb 7th, less than 2 weeks away.  ->
http://haystackconf.com

It's your ethical imperative to share! ;)
https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/

And no talk is too small, people often underestimate what they're doing,
and very much underestimate how interesting others will find your story!
The best talks often come from the least expected people & orgs.

On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull  wrote:


Hi all,

Haystack, the search relevance conference, is confirmed for 29th & 30th
April 2020 in Charlottesville, Virginia - the CFP is open and we need
your contributions! More information at www.haystackconf.com
<http://www.haystackconf.com>including links to previous talks, deadline
is 31st January. We'd love to hear your Lucene/Solr relevance stories.

Cheers

Charlie
--

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr

2020-01-28 Thread Charlie Hull
We're expecting prices to be very similar to last year - early bird will 
be $300 ish for conference only and $2250 ish for conference plus a 
training (we're running no less than 5 different classes that week 
including Think Like a Relevance Engineer, Hello LTR and NLP) - 
hopefully this will give you enough information for budgeting.


Speakers get a small discount too!

Cheers

Charlie

On 27/01/2020 22:21, John Blythe wrote:

Hey Doug. Do you know the pricing yet? Trying to get something submitted to
VP so I can take my team to the conference. Thanks!

On Mon, Jan 27, 2020 at 14:54 Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:


Just an update the CFP was extended to Feb 7th, less than 2 weeks away.  ->
http://haystackconf.com

It's your ethical imperative to share! ;)

https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/

And no talk is too small, people often underestimate what they're doing,
and very much underestimate how interesting others will find your story!
The best talks often come from the least expected people & orgs.

On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull  wrote:


Hi all,

Haystack, the search relevance conference, is confirmed for 29th & 30th
April 2020 in Charlottesville, Virginia - the CFP is open and we need
your contributions! More information at www.haystackconf.com
<http://www.haystackconf.com>including links to previous talks, deadline
is 31st January. We'd love to hear your Lucene/Solr relevance stories.

Cheers

Charlie
--

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



--
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Update synonyms.txt file based on values in the database

2020-01-16 Thread Charlie Hull
Try looking into Managed Resources: 
https://lucene.apache.org/solr/guide/8_4/managed-resources.html


Charlie

On 15/01/2020 10:35, seeteshh wrote:

How do I update the synonyms.txt file if the data is being fetched from a
database say PostgreSQL since I wont be able to update the synonmys.txt file
every time manually and also the data is related to a table and not known to
Solr.

I am using Apache Solr 8.4.

Regards,

Seetesh hindlekar



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-16 Thread Charlie Hull

On 15/01/2020 11:42, Dc Tech wrote:

Thank you Jan and Charlie.

I should say that in terms of posting to the community regarding Elastic vs 
Solr - this is probably the most civil and helpful community that I have been a 
part of - and your answers have only reinforced that  notion !!

Thank you for your responses. I am glad to hear that both can do most of it, 
which was my gut feeling as well.

Charlie, to your point - the team probably feels that Elastic  is easier to get 
started with hence the preference, as well as the hosting options (with the 
caveats you noted). Agree with you completely that tech is not the real issue.

Jan,  agree with  the points you made on team skills.  On our previous 
proprietary engine - that was in fact the biggest issue - the engine was 
powerful enough and had good references.  However, we were not able to exploit 
it to good effect.


Hi again,

The dirty secret that few will voice is that...most search engines are 
basically the same. Once you've worked on a search project you can apply 
those skills to any future search engine. This is why I'm currently 
focused on supporting the search team, not the search tech - how do you 
learn and improve those relevance tuning skills, considering it's 
really, really hard to recruit people with existing high-level search 
skills (and if you can find them you probably can't afford them).


Cheers

Charlie



Thank you again.


On Jan 15, 2020, at 5:10 AM, Jan Høydahl  wrote:

Hi,

Choosing the solr community mailing list to ask advice for whether to choose ES 
- you already know what to expect, not?
More often than not the choice comes down to policy, standardization, what 
skills you have in the house etc rather than ticking off feature checkboxes.
Sometimes company values also may drive a choice, i.e. Solr is 100% Apache and 
not open core, which may matter if you plan to get involved in the community, 
and contribute features or patches.

However, if I were in your shoes as architect to evaluate tech stack, and there 
was not a clear choice based on the above, I’d do what projects normally do, to 
ask yourself what you really need from the engine. Maybe you have some features 
in your requirement list that makes one a much better choice over the other. Or 
maybe after that exercise you are still wondering what to choose, in which case 
you just follow your gut feeling and make a choice :)

Jan


15. jan. 2020 kl. 10:07 skrev Charlie Hull :


On 15/01/2020 04:02, Dc Tech wrote:
I am SOLR fant and had implemented it in our company over 10 years ago.
I moved away from that role and the new search team in the meanwhile
implemented a proprietary (and expensive) nosql style search engine. That
the project did not go well, and now I am back to project and reviewing the
technology stack.

Some of the team think that ElasticSearch could be a good option,
especially since we can easily get hosted versions with AWS where we have
all the contractual stuff sorted out.

You can, but you should be aware that:
1. Amazon's hosted Elasticsearch isn't great, often lags behind the current 
version, doesn't allow plugins etc.
2.  Amazon and Elastic are currently engaged in legal battles over who is the 
most open sourcey,who allegedly copied code that was 'open' but commercially 
licensed, who would like to capture the hosted search market...not sure how 
this will pan out (Google for details)
3. You can also buy fully hosted Solr from several places.

Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
graph, and all the knobs and dials for relevancy tuning), Elastic may be
sufficient for our needs. It does not seem to have LTR out of the box but
the relevancy tuning knobs and dials seem to be similar to what SOLR has.

Yes, they're basically the same under the hood (unsurprising as they're both 
based on Lucene). If you need LTR there's an ES plugin for that (disclaimer, my 
new employer built and maintains it: 
https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track of the 
amount of times I've been asked 'Elasticsearch or Solr, which should I choose?' 
and my current thoughts are:

1. Don't switch from one to the other for the sake of it.  Switching search 
engines rarely addresses underlying issues (content quality, team skills, 
relevance tuning methodology)
2. Elasticsearch is easier to get started with, but at some point you'll need 
to learn how it all works
3. Solr is harder to get started with, but you'll know more about how it all 
works earlier
4. Both can be used for most search projects, most features are the same, both 
can scale.
5. Lots of Elasticsearch projects (and developers) are focused on logs, which 
is often not really a 'search' project.


The corpus size is not a challenge  - we have about one million document,
of which about 1/2 have full text, while the test are simpler (i.e. company
directory etc.).
The query volumes are also quite low (max 5/second at peak).
We have

Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-15 Thread Charlie Hull

On 15/01/2020 04:02, Dc Tech wrote:

I am SOLR fant and had implemented it in our company over 10 years ago.
I moved away from that role and the new search team in the meanwhile
implemented a proprietary (and expensive) nosql style search engine. That
the project did not go well, and now I am back to project and reviewing the
technology stack.

Some of the team think that ElasticSearch could be a good option,
especially since we can easily get hosted versions with AWS where we have
all the contractual stuff sorted out.

You can, but you should be aware that:
1. Amazon's hosted Elasticsearch isn't great, often lags behind the 
current version, doesn't allow plugins etc.
2.  Amazon and Elastic are currently engaged in legal battles over who 
is the most open sourcey,who allegedly copied code that was 'open' but 
commercially licensed, who would like to capture the hosted search 
market...not sure how this will pan out (Google for details)

3. You can also buy fully hosted Solr from several places.

Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
graph, and all the knobs and dials for relevancy tuning), Elastic may be
sufficient for our needs. It does not seem to have LTR out of the box but
the relevancy tuning knobs and dials seem to be similar to what SOLR has.
Yes, they're basically the same under the hood (unsurprising as they're 
both based on Lucene). If you need LTR there's an ES plugin for that 
(disclaimer, my new employer built and maintains it: 
https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track 
of the amount of times I've been asked 'Elasticsearch or Solr, which 
should I choose?' and my current thoughts are:


1. Don't switch from one to the other for the sake of it.  Switching 
search engines rarely addresses underlying issues (content quality, team 
skills, relevance tuning methodology)
2. Elasticsearch is easier to get started with, but at some point you'll 
need to learn how it all works
3. Solr is harder to get started with, but you'll know more about how it 
all works earlier
4. Both can be used for most search projects, most features are the 
same, both can scale.
5. Lots of Elasticsearch projects (and developers) are focused on logs, 
which is often not really a 'search' project.




The corpus size is not a challenge  - we have about one million document,
of which about 1/2 have full text, while the test are simpler (i.e. company
directory etc.).
The query volumes are also quite low (max 5/second at peak).
We have implemented the content ingestion and processing pipelines already
in python and SPARK, so most of the data will be pushed in using APIs.

I would really appreciate any guidance from the community !!

Sounds like a pretty small setup to be honest, but as ever the devil is 
in the details.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search (now part of OpenSourceConnections)

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19.com



Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr

2020-01-09 Thread Charlie Hull

Hi all,

Haystack, the search relevance conference, is confirmed for 29th & 30th 
April 2020 in Charlottesville, Virginia - the CFP is open and we need 
your contributions! More information at www.haystackconf.com 
<http://www.haystackconf.com>including links to previous talks, deadline 
is 31st January. We'd love to hear your Lucene/Solr relevance stories.


Cheers

Charlie
--

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: hi question about solr

2019-12-02 Thread Charlie Hull

Hi,

https://livebook.manning.com/book/solr-in-action/chapter-3 may help (I'd 
suggest reading the whole book as well).


Basically what you're looking for is the 'term position'. The 
TermVectorComponent in Solr will allow you to return this for each result.


Cheers

Charlie

On 02/12/2019 11:24, eli chen wrote:

hi im kind of new to solr so please be patient

i'll try to explain what do i need and what im trying to do.

we a have a lot of books content and we want to index them and allow search
in the books.
when someone search for a term
i need to get back the position of matchen word in the book
for example
if the book content is "hello my name is jeff" and someone search for "my".
i want to get back the position of my in the content field (which is 1 in
this case)
i tried to do that with payloads but no success. and another problem i
encourage is .
lets say the content field is "hello my name is jeff what is your name".
now if someone search for "name" i want to get back the index of all
occurrences not just the first one

is there any way to that with solr without develop new plugins

thx



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Icelandic support in Solr

2019-11-27 Thread Charlie Hull

On 26/11/2019 16:35, Mikhail Ibraheem wrote:

Hi,Does Solr supports Icelandic language out of the box? If not, can you please 
let me know how to add that with custom analyzers?
Thanks


The Snowball stemmer project which is used by Solr 
(https://snowballstem.org/algorithms/ - co-created by Martin Porter, 
author of the famous stemmer) doesn't support Icelandic unfortunately. I 
can't find any other stemmers that you could use in Solr.


Basis Technology offer various commercial software for language 
processing that can work with Solr and other engines, not sure if they 
support Icelandic.


So, not very positive I'm afraid: you could look into creating your own 
stemmer using Snowball, or some heuristic approaches, but you'd need a 
good grasp of the structure of the language.



Best


Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Active directory integration in Solr

2019-11-19 Thread Charlie Hull
Not out of the box, there are a few authentication plugins bundled but 
not for AD 
https://lucene.apache.org/solr/guide/7_2/authentication-and-authorization-plugins.html 
- there's also some useful stuff in Apache ManifoldCF 
https://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/ 



Best

Charlie

On 18/11/2019 15:08, Kommu, Vinodh K. wrote:

Hi,

Does anyone know that Solr has any out of the box capability to integrate 
Active directory (using LDAP) when security is enabled? Instead of creating 
users in security.json file, planning to use users who already exists in active 
directory so they can use their individual credentials rather than defining in 
Solr. Did anyone came across similar requirement? If so was there any working 
solution?


Thanks,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: solr UI collection dropdown sorting order

2019-10-21 Thread Charlie Hull
I think we looked at this at our recent Hackday in DC - check out the 
first part of this blog: 
https://opensourceconnections.com/blog/2019/09/23/what-happens-at-a-lucene-solr-hackday/ 
- hopefully a pointer towards getting this fixed.


Best

Charlie

On 20/10/2019 09:06, Sotiris Fragkiskos wrote:

Hi everyone!

is there any way the collections available on the left-hand side of the
solr UI can be sorted? I'm referring to the "collection selector" dropdown.
But the same applies to the Collections button.
The sorting seems kind of..random?

Thanks in advance!

Sotir



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-16 Thread Charlie Hull
My colleagues Eric Pugh and Dan Worley covered OCR and Solr in a 
presentation at our recent London Lucene/Solr Meetup:

https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/264579498/
(direct link to slides if you can't find it in the comments 
https://www.slideshare.net/o19s/payloads-and-ocr-with-solr)


HTH

Charlie


On 14/10/2019 11:40, Retro wrote:

Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to OCR attachments and get their content indexed in SOLR.

Davis, Daniel (NIH/NLM) [C] wrote

Nuance and ABBYY provide OCR capabilities as well.
Looking at higher level solutions, both indexengines.com and Comvault can
do email remediation for legal issues.

AJ Weber wrote

There are alternative, paid, libraries to parse and extract attachments
from EML files as well
EML attachments will have a mimetype associated with their metadata.

Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Hackday in DC next Tuesday

2019-09-03 Thread Charlie Hull

Hi all,

If you're in town for Activate next week, we're running another free 
Lucene Hackday on Tuesday: 
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/263993681/ 
- do come along if you can! It's only a block and a half from the 
Activate venue.


Cheers

Charlie

--
Charlie Hull
OpenSource Connections

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Ranking

2019-07-29 Thread Charlie Hull
1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, 
port 10024)
with ESMTP id tkntRGqBd7lZ for ;
Sat, 27 Jul 2019 20:55:59 + (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.222.175; 
helo=mail-qk1-f175.google.com; envelope-from=erik.hatc...@gmail.com; 
receiver=
Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com 
[209.85.222.175])
by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) 
with ESMTPS id 261BCBC7B3
for ; Sat, 27 Jul 2019 20:55:59 + (UTC)
Received: by mail-qk1-f175.google.com with SMTP id d15so41571526qkl.4
 for ; Sat, 27 Jul 2019 13:55:59 -0700 
(PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20161025;
 h=from:content-transfer-encoding:mime-version:date:subject:message-id
  :references:in-reply-to:to;
 bh=1gHvTKtoTkpa065pNgBbCPIiB7MlA4jsaGdI1mo8Lbo=;
 b=ZC6lb5CmIWySYfPuspRyKS8kpKRIgrHEALHWqB+cXPH187pmfYwKnSr1LIMNGiJso5
  PBWWaIV8Rdt1rCOEiIZk6hWbC9xEsiSiAYuirIpJMAKsjigJXr+ua25jQDKB5EL/DIJ9
  7Ygo2v5BzEmGb6h3Fxvmq71HEkwuOd5+Vi+6OoZdpkiuseD+pfEVUCp0FC0uAoP7wJKA
  J/Z9xJvU4m0kCvIo9ofeNNCv/nmMBjBUjZOvA6EUOfKPuBf0HOT6rW1K5gUenabNTc3Y
  hgqN3i5d8mRfM531Ts0/s90EbSrN+yKLnXsi5J7Y+ZGJzLgybGajBuJpGUy8zSxaq138
  a7Mw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:content-transfer-encoding:mime-version:date
  :subject:message-id:references:in-reply-to:to;
 bh=1gHvTKtoTkpa065pNgBbCPIiB7MlA4jsaGdI1mo8Lbo=;
 b=TQjzBgBLERdlcF7x7vkFeoWbONWInnLJTGH5xre4s0oCCMzTrqF3s3Fh6z8unQrOz4
  6WY0czoSp83jXHH4mQqoERTz1gaIXZZguzwNBPWe8t76Qf+GCpXCsxU6ZLG6Cn/qydup
  JcjcqeERlOMRySbUA17L7cDrUXWGh7x14KkdJqSByrXqatT00astGrTJswcmEfxiULTd
  cFMja9+dBSEGradQMPQfkvKB3rizOjauXO13LojKmXpfrX3h5oSXPk1QdscVDBzMDBkd
  rpUgMBLWVo/PgJ269AfhfAkr0sNeWfk0Vm+IOmLRokJ2OrOYoRR9i16uH1+r/GRxSqrY
  Prhg==
X-Gm-Message-State: APjAAAWgIU3qTtZge+065LST9X7uBq4HN90TvcjzsAQas1RpKTe48fSP
AmBL+r3+kuch3DEuvd7/tbw/1siqIXo=
X-Received: by 2002:a37:4e92:: with SMTP id c140mr62121531qkb.48.1564260952874;
 Sat, 27 Jul 2019 13:55:52 -0700 (PDT)
Received: from [192.168.0.102] ([71.51.161.116])
 by smtp.gmail.com with ESMTPSA id 
r26sm24358675qkm.57.2019.07.27.13.55.52
 for 
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 27 Jul 2019 13:55:52 -0700 (PDT)
From: Erik Hatcher 
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)
Date: Sat, 27 Jul 2019 16:55:51 -0400
Subject: Re: Ranking
Message-Id: <9df60f32-0a60-4c0f-90c2-98a76b698...@gmail.com>
References: 
In-Reply-To: 

To: solr-user@lucene.apache.org
X-Mailer: iPhone Mail (16F203)

The details of the scoring can be seen by setting =true

 Erik


On Jul 27, 2019, at 15:40, Steven White  wrote:

Hi everyone,

I have 2 files like so:

FA has the letter "i" only 2 times, and the file size is 54,246 bytes
FB has the letter "i" 362 times and the file size is 9,953

When I search on the letter "i" FB is ranked lower which confuses me
because I was under the impression the occurrences of the term in a
document and the document size is a factor as such I was expecting FB to
rank higher.  Did I get this right?  If not, what's causing FB to rank
lower?

I'm on Solr 8.1

Thanks

Steven



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Quepid, the relevance testing tool for Solr, released as open source

2019-07-26 Thread Charlie Hull

Hi all,

We've finally made Quepid, the relevance testing tool, open source. 
There's also a free hosted version at www.quepid.com . Looking forward 
to contributions driving the project forward! Quepid is a way to record 
human relevance judgements, and then to experiment with query tuning and 
see the results in real time.


More details at 
https://opensourceconnections.com/blog/2019/07/25/2019-07-22-quepid-is-now-open-source/


(also particularly pleased to see Luwak, the stored query engine we 
built at Flax become part of Lucene - it's a great day for open source!)


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Charlie Hull
Simpler possibly, but not necessarily reliable. If you do everything 
inside Solr's DIH with Tika under the hood to extract data from Excel, a 
malformed Excel file could kill Tika and bring down your entire Solr 
cluster. Far better to do it outside of Solr as this blog describes: 
https://lucidworks.com/post/indexing-with-solrj/


If you want to see what Tika does to your Excel examples this is quite a 
neat way to experiment: https://okfnlabs.org/projects/tika-server/


Cheers

Charlie

On 26/07/2019 09:44, Vipul Bahuguna wrote:

Hi Charlie,

Thanks for your suggestion,  but I will have thousands of these files
coming from different sources. It would become very tedious if I have to
first convert them to csv and then run liny by line.

I was hoping if there could be a simpker way to achieve these using DIH
which I thought can be configured to read and ingest MS Excel (xlsx)
files.

I am not too sure of how the configuration file would look like.

Any pointers are welcome. Thanks!

On Fri, 26 Jul, 2019, 1:56 PM Charlie Hull,  wrote:


Convert the Excel file to a CSV and then write a teeny script to go
through it line by line and submit to Solr over HTTP? Tika would
probably work but it's a lot of heavy lifting for what seems to me like
a simple problem.

Cheers

Charlie

On 26/07/2019 09:19, Vipul Bahuguna wrote:

Hi Guys - can anyone suggest how to achieve this?
I have understood how to insert json documents. So one alternative that
comes to my mind is that I can convert the rows in my excel to json

format

with the header of my excel file becoming the json keys (corresponding to
the fields I have defined in my managed-schema.xml). And then each cell

in

the excel file will become the value of this field.

However, I am sure there must be a better way and directly ingesting the
excel file to achieve the same. I was trying to reach about DIH and

Apache

Tika, but I am not very sure of how the configuration works.

My sample excel file has 4 columns namely -
1. First Name
2. Last Name
3. Phone
4. Website Link

I want to index these fields into SOLR in a way that all these columns
become my solr schema fields and later I can search based on these

fields.

Any suggestions please.

thanks !


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Indexig excel (xlsx) file into SOLR 8.1.1

2019-07-26 Thread Charlie Hull
Convert the Excel file to a CSV and then write a teeny script to go 
through it line by line and submit to Solr over HTTP? Tika would 
probably work but it's a lot of heavy lifting for what seems to me like 
a simple problem.


Cheers

Charlie

On 26/07/2019 09:19, Vipul Bahuguna wrote:

Hi Guys - can anyone suggest how to achieve this?
I have understood how to insert json documents. So one alternative that
comes to my mind is that I can convert the rows in my excel to json format
with the header of my excel file becoming the json keys (corresponding to
the fields I have defined in my managed-schema.xml). And then each cell in
the excel file will become the value of this field.

However, I am sure there must be a better way and directly ingesting the
excel file to achieve the same. I was trying to reach about DIH and Apache
Tika, but I am not very sure of how the configuration works.

My sample excel file has 4 columns namely -
1. First Name
2. Last Name
3. Phone
4. Website Link

I want to index these fields into SOLR in a way that all these columns
become my solr schema fields and later I can search based on these fields.

Any suggestions please.

thanks !



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Understanding DebugQuery

2019-07-09 Thread Charlie Hull

Hi Paresh,

There are various tools available for breaking down the Debug query: 
www.splainer.io (disclaimer, I work for OSC who wrote this) and a few 
others - check out section 4 of this post for more 
http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/


Cheers

Charlie

On 09/07/2019 06:43, Paresh Khandelwal wrote:

Hi All,

I tried to get the debug information about the query for my INNER JOIN and
ACROSS JOIN and trying to understand it.

See the query below - 1487 msec
 {
   "responseHeader":{
 "status":0, "QTime":1487,
 "params":{  "q":"*:*",
   "fq.op":"AND",   "indent":"on",
   "fl":"TC_0Y0_Item_ID",
   "fq":["TC_0Y0_Occurrence_Name:\"6935 style rear MY11+\"",
 "TC_0Y0_ProductScope:xtWNf_fTAaLUgD",
 "{!join to=TC_0Y0_Item_ID
from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id
fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773"],
   "wt":"json",   "debugQuery":"on",
   "group.field":"TC_0Y0_Item_ID", ..
   "debug":{
 "join":{
   "{!join from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id
to=TC_0Y0_Item_ID
fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773":{
 "time":955,
 "fromSetSize":3,
 "toSetSize":14560,
 "fromTermCount":6632106,
 "fromTermTotalDf":6632106,
 "fromTermDirectCount":6632106,
 "fromTermHits":1,
 "fromTermHitsTotalDf":1,
 "toTermHits":1,
 "toTermHitsTotalDf":14560,
 "toTermDirectCount":0,
 "smallSetsDeferred":1,
 "toSetDocsAdded":14560}},
 "rawquerystring":"*:*",
 "querystring":"*:*",
 "parsedquery":"MatchAllDocsQuery(*:*)",
 "parsedquery_toString":"*:*",
 "explain":{
   "AZD1uV0qgj6GxC":"\n1.0 = *:*, product of:\n  1.0 = boost\n
  1.0 = queryNorm\n"},
 "QParser":"LuceneQParser",
 "filter_queries":["TC_0Y0_Occurrence_Name:\"6935 style rear
MY11+\"",
   "TC_0Y0_ProductScope:xtWNf_fTAaLUgD",
   "{!join to=TC_0Y0_Item_ID
from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id
fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773"],
 "parsed_filter_queries":["TC_0Y0_Occurrence_Name:6935 style
rear MY11+",
   "TC_0Y0_ProductScope:xtWNf_fTAaLUgD",
   "JoinQuery({!join
from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id to=TC_0Y0_Item_ID
fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773)"],
 "timing":{
   "time":1487.0,  ..

I am trying to see why fromTermCount is so high when fromSetSize and
toSetSize is less?

Where can I find the details about all the contents of debugQuery and how
to read each component?

Any help is appreciated.

Regards,
Paresh



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-05 Thread Charlie Hull

On 05/07/2019 14:33, Joseph_Tucker wrote:

Thanks for your help / suggestion.

I'm not sure I completely follow in this case.
SolrJ looks like a method to allow Java applications to talk to Solr, or any
other third party application would simply be a communication method between
Solr and the language of your choosing.

I guess what I'm after is, how would using SolrJ improve performance when
indexing?


It's not just about improving performance (although DIH is single 
threaded, so you could obtain a marked indexing performance gain using a 
client such as SolrJ).  With DIH you will embed a lot of SQL code into 
Solr's configuration files, and the more sources you add the more 
complicated, hard to debug and unmaintainable it's going to be. You 
should thus consider writing a proper indexing script in Java, Python or 
whatever language you are most familiar with - this has always been our 
approach.


Best


Charlie



*** I could be wrong in my assumptions as I'm still learning a great deal
about Solr. ***

I appreciate your help

Regards,

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: Solr upgrade question

2019-07-05 Thread Charlie Hull

On 05/07/2019 14:49, Margo Breäs | INDI wrote:


Hi all,

At the moment we are working with Solr version 4.8.1 in combination 
with an older version of Intershop.


We have recently migrated our entire shop to a new party, and so there 
is room for improvements.


Are there any known issues with upgrading over that many versions in 
general, or with an Intershop version specifically?


If so we would appreciate your experiences/stories, so we can mitigate 
things beforehand.


If you're going to migrate from that old a version of Solr, I think you 
will need to re-index completely and also check that all your queries 
work as you expect...there have been a lot of changes since then and 
don't underestimate the task!


Cheers


Charlie


Thanks in advance,

best regards,

Margo Breas | INDI


Met vriendelijke groet / Kind regards,​


Margo Breäs
​Categoriespecialist
​T. +31 88 0666 000
​E. *margo.br...@indi.nl* <mailto:margo.br...@indi.nl>
*​W. www.indi.nl* 
<https://www.indi.nl/nl-nl/?utm_medium=email_source=email_handtekening_campaign=margo_breas> 



INDI.nl website 
<https://www.indi.nl/nl-nl/?utm_medium=email_source=email_handtekening_campaign=margo_breas> 





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: SOLR (6.3.0) Initialization Issue

2019-05-10 Thread Charlie Hull

On 10/05/2019 08:52, Charlie Hull wrote:

On 09/05/2019 19:18, SAMMAR UL HASSAN wrote:


Hi   Support Team,

  I hope all is well. Let me explain what we are, what we are 
currently doing & what we want from you.


We are IT based healthcare company, providing healthcare software 
services (EHR/EMR) to doctors across the U.S. In many important 
modules of our products we have implemented SOLR based smart search. 
We know the basics of SOLR & we are doing well to achieve our 
requirements but we face some issues time to time and try to resolve 
as per our best knowledge. At this moment, we are facing the attached 
errors and need your support to resolve this issue permanently. We 
will appreciate if you arrange the call to discuss this issue. In 
case any additional information please let us know.



Hi,

Solr is an open source product, so you have various options to get 
support. I'm assuming you've already done your own research around the 
issues you're facing.


1. ask on this mailing list, providing as much detail as you can, and 
hopefully someone will be able to help - but be aware that those who 
respond are volunteering their time from often very busy lives - and 
no-one is likely to want to arrange a call.
2. engage a professional services company (disclaimer: I work for 
OpenSource Connections who provide this sort of help, there are many 
others - see https://wiki.apache.org/solr/Support for individuals and 
companies who know Solr)

3. train up your own team on Solr, there are many courses available.

4. this list sometimes strips attachments, so I'm afraid the list of 
errors you supplied didn't arrive - perhaps put them inline?


C



HTH

Charlie


*Regards*

Syed Sammar ul Hassan

*Lead Surescripts-Development*

MTBC | A Unique Healthcare IT Company®

7 Clyde Road | Somerset, NJ 08873

P: 732-873-5133 x319 | F:  732-873-3378

www.mtbc.com <http://www.mtbc.com/>| sammarulhas...@mtbc.com 
<mailto:sammarulhas...@mtbc.com>


Follow MTBC on Twitter, LinkedIn and Facebook

ONC-ACB Certified EHR | Deloitte® Technology Fast 500 | SureScripts® 
Solution Provider | Microsoft® Gold Certified Partner | Inc. 500|5000®


NOTICE: The information contained in this e-mail message is 
confidential and intended only for the personal and confidential use 
of the designated recipient(s) named above. If the reader of this 
message is not the intended recipient or an agent responsible for 
delivering it to the intended recipient, you have received this 
document in error, and any review, distribution, or copying of this 
message is strictly prohibited.  If you have received this 
communication in error, please notify us immediately by email or 
telephone and delete the original message in its entirety.  MTBC, the 
stylized MTBC logo, A Unique Healthcare IT Company and other MTBC 
logos, product and service names are trademarks of MTBC.




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web:www.flax.co.uk



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: SOLR (6.3.0) Initialization Issue

2019-05-10 Thread Charlie Hull

On 09/05/2019 19:18, SAMMAR UL HASSAN wrote:


Hi   Support Team,

  I hope all is well. Let me explain what we are, what we are 
currently doing & what we want from you.


We are IT based healthcare company, providing healthcare software 
services (EHR/EMR) to doctors across the U.S. In many important 
modules of our products we have implemented SOLR based smart search. 
We know the basics of SOLR & we are doing well to achieve our 
requirements but we face some issues time to time and try to resolve 
as per our best knowledge. At this moment, we are facing the attached 
errors and need your support to resolve this issue permanently. We 
will appreciate if you arrange the call to discuss this issue. In case 
any additional information please let us know.



Hi,

Solr is an open source product, so you have various options to get 
support. I'm assuming you've already done your own research around the 
issues you're facing.


1. ask on this mailing list, providing as much detail as you can, and 
hopefully someone will be able to help - but be aware that those who 
respond are volunteering their time from often very busy lives - and 
no-one is likely to want to arrange a call.
2. engage a professional services company (disclaimer: I work for 
OpenSource Connections who provide this sort of help, there are many 
others - see https://wiki.apache.org/solr/Support for individuals and 
companies who know Solr)

3. train up your own team on Solr, there are many courses available.

HTH

Charlie


*Regards*

Syed Sammar ul Hassan

*Lead Surescripts-Development*

MTBC | A Unique Healthcare IT Company®

7 Clyde Road | Somerset, NJ 08873

P:  732-873-5133 x319 | F:  732-873-3378

www.mtbc.com <http://www.mtbc.com/>| sammarulhas...@mtbc.com 
<mailto:sammarulhas...@mtbc.com>


Follow MTBC on Twitter, LinkedIn and Facebook

ONC-ACB Certified EHR | Deloitte® Technology Fast 500 | SureScripts® 
Solution Provider | Microsoft® Gold Certified Partner | Inc. 500|5000®


NOTICE: The information contained in this e-mail message is 
confidential and intended only for the personal and confidential use 
of the designated recipient(s) named above. If the reader of this 
message is not the intended recipient or an agent responsible for 
delivering it to the intended recipient, you have received this 
document in error, and any review, distribution, or copying of this 
message is strictly prohibited.  If you have received this 
communication in error, please notify us immediately by email or 
telephone and delete the original message in its entirety.  MTBC, the 
stylized MTBC logo, A Unique Healthcare IT Company and other MTBC 
logos, product and service names are trademarks of MTBC.




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: solr search Ontology based data set

2019-03-14 Thread Charlie Hull

On 13/03/2019 17:01, Jie Luo wrote:

Hi all,

I have several ontology based data sets, I would like to use solr as search 
engine. Solr document is flat document. I would like to know how it is the best 
way to handle the search.

Simple search is fine. One possible search I will need to retrieve the ontology 
tree or graph

Best regards

Jie


Are you aware of the BioSolr project? Have a chat to Sameer Velankar at 
EBI. There's some background here


https://github.com/flaxsearch/BioSolr
https://www.ebi.ac.uk/spot/BioSolr/

Various ontology indexing code for Solr was developed as part of this 
project.


Best

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Re: SOLR and AWS comprehend

2019-02-14 Thread Charlie Hull

On 13/02/2019 12:17, Gareth Baxendale wrote:

This is perhaps more or an architecture question than dev code but
appreciate collective thoughts!

We are using Solr to order records and to categorise them to allow users to
search and find specific medical conditions. We have an opportunity to make
use of Machine Learning to aid and improve the results. AWS Comprehend is
the product we are looking at but there is a question over whether one
should replace the other as they would compete or if in fact both should
work together to provide the solution we are after.


One is an open source search engine and one is a closed source hosted 
NLP service you pay for. I think you're comparing chalk and cheese here: 
you would use a NLP service to enhance the source data before indexing 
with something like Solr, or extract information from a query before 
searching. Although Solr does contain some classification features it 
doesn't contain any NLP features - although as my colleague Liz writes 
you can now easily integrate Solr & OpenNLP, another open source 
toolkit. 
https://opensourceconnections.com/blog/2018/08/06/intro_solr_nlp_integrations/


By the way are you aware that NHS Wales are using Solr to power their 
patient records service?


Best

Charlie


Appreciate any insights people have.

Thanks Gareth




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Haystack Relevance Conference Announced; CFP ends Jan 9!

2019-01-09 Thread Charlie Hull

Hi all,

Just to let you know the CFP has been extended until January 30th and 
we're really looking forward to seeing your proposals! 
http://haystackconf.com


Cheers

Charlie


On 27/11/2018 22:33, Doug Turnbull wrote:

Hey everyone,

Many of you may know about/have been to Haystack - The Search Relevance
Conference.
http://haystackconf.com

We're excited to announce 2019's Haystack, April 22-25 in Charlottesville,
VA, USA. Our CFP due January 9th.

We want to bring together practitioners that work on really interesting
search relevance problems. We want talks that really get into the
nitty-gritty of improving relevance, getting into technically meaty talks
in applied Information Retrieval based on open source search.

We know the Solr community is chock full of great ideas and problems
solved, and we look forward to hearing about the tough problems you've
solved with Solr/Lucene/Elasticsearch/Vespa/A Team of Trained
Hamsters/whatever.

Best
-Doug




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: SV: Tool to format the solr query for easier reading?

2019-01-08 Thread Charlie Hull

On 08/01/2019 09:20, Hullegård, Jimi wrote:

Hi Charlie,

Care to elaborate on that a little? I can't seem to find any tool in that blog 
entry that formats a given solr query. What tool did you have in mind?


Hi Jimi,

I recalled that the Chrome plugin would do this, obviously it's not a 
perfect solution for you as you've prefer a Java formatter but it's a 
start - have you tried this one?


Best

Charlie


/Jimi

-Ursprungligt meddelande-
Från: Charlie Hull 
Skickat: den 8 januari 2019 15:55
Till: solr-user@lucene.apache.org
Ämne: Re: Tool to format the solr query for easier reading?

On 08/01/2019 04:33, Hullegård, Jimi wrote:

Hi,


Hi Jimi,

There are some suggestions in part 4 of my recent blog:
http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/

Cheers

Charlie


I often find myself having to analyze an already existing solr query. But when 
the number of clauses and/or number of nested parentheses reach a certain level 
I can no longer grasp what the query is about by just a quick glance. Sometimes 
I can look at the code generating the query, but it might be autogenerated in a 
complex way, or I might only have access to a log output of the query.

Here is an example query, based on a real query in our system:


system:(a) type:(x OR y OR z) date1:[* TO
2019-08-31T06:15:00Z/DAY+1DAYS] ((boolean1:false OR date2:[* TO
2019-08-31T06:15:00Z/DAY-30DAYS]))
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR
date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]))


Here I find it quite difficult to what clauses are grouped together (using 
parentheses). What I tend to do in these circumstances is to copy the query 
into a simple text editor, and then manually add line breaks and indentation 
matching the parentheses levels.

For the query above, it would result in something like this:


system:(a)
type:(x OR y OR z)
date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] (
   (boolean1:false OR date2:[* TO
2019-08-31T06:15:00Z/DAY-30DAYS])
)
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (
   ((*:* -date4:*) OR date5:* OR date3:[*
TO 2019-08-31T06:15:00Z/DAY+1DAYS])
)


But that is a slow process, and I might make a mistake that messes up the 
interpretation completely. Especially when there are several levels of nested 
parentheses.

Does anyone know of any kind of tool that would help automate this? It wouldn't 
have to format its output like my example, as long as it makes it easier to see 
what start and end parentheses belong to each other, preferably using multiple 
lines and indentation.

A java tool would be perfect, because then I could easily integrate it into our 
existing debugging tools, but an online formatter (like 
http://jsonformatter.curiousconcept.com) would also be very useful.

Regards
/Jimi

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR.
Här kan du läsa mer om vår behandling och dina rättigheter,
Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integrite
t-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email
m_medium=email>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa 
mer om vår behandling och dina rättigheter, 
Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: SV: Tool to format the solr query for easier reading?

2019-01-08 Thread Charlie Hull

On 08/01/2019 09:20, Hullegård, Jimi wrote:

Hi Charlie,

Care to elaborate on that a little? I can't seem to find any tool in that blog 
entry that formats a given solr query. What tool did you have in mind?


This also does some basic URL splitting: 
https://www.freeformatter.com/url-parser-query-string-splitter.html


Cheers

Charlie


/Jimi

-Ursprungligt meddelande-
Från: Charlie Hull 
Skickat: den 8 januari 2019 15:55
Till: solr-user@lucene.apache.org
Ämne: Re: Tool to format the solr query for easier reading?

On 08/01/2019 04:33, Hullegård, Jimi wrote:

Hi,


Hi Jimi,

There are some suggestions in part 4 of my recent blog:
http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/

Cheers

Charlie


I often find myself having to analyze an already existing solr query. But when 
the number of clauses and/or number of nested parentheses reach a certain level 
I can no longer grasp what the query is about by just a quick glance. Sometimes 
I can look at the code generating the query, but it might be autogenerated in a 
complex way, or I might only have access to a log output of the query.

Here is an example query, based on a real query in our system:


system:(a) type:(x OR y OR z) date1:[* TO
2019-08-31T06:15:00Z/DAY+1DAYS] ((boolean1:false OR date2:[* TO
2019-08-31T06:15:00Z/DAY-30DAYS]))
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR
date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]))


Here I find it quite difficult to what clauses are grouped together (using 
parentheses). What I tend to do in these circumstances is to copy the query 
into a simple text editor, and then manually add line breaks and indentation 
matching the parentheses levels.

For the query above, it would result in something like this:


system:(a)
type:(x OR y OR z)
date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] (
   (boolean1:false OR date2:[* TO
2019-08-31T06:15:00Z/DAY-30DAYS])
)
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (
   ((*:* -date4:*) OR date5:* OR date3:[*
TO 2019-08-31T06:15:00Z/DAY+1DAYS])
)


But that is a slow process, and I might make a mistake that messes up the 
interpretation completely. Especially when there are several levels of nested 
parentheses.

Does anyone know of any kind of tool that would help automate this? It wouldn't 
have to format its output like my example, as long as it makes it easier to see 
what start and end parentheses belong to each other, preferably using multiple 
lines and indentation.

A java tool would be perfect, because then I could easily integrate it into our 
existing debugging tools, but an online formatter (like 
http://jsonformatter.curiousconcept.com) would also be very useful.

Regards
/Jimi

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR.
Här kan du läsa mer om vår behandling och dina rättigheter,
Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integrite
t-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email
m_medium=email>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa 
mer om vår behandling och dina rättigheter, 
Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Tool to format the solr query for easier reading?

2019-01-08 Thread Charlie Hull

On 08/01/2019 04:33, Hullegård, Jimi wrote:

Hi,


Hi Jimi,

There are some suggestions in part 4 of my recent blog: 
http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/


Cheers

Charlie


I often find myself having to analyze an already existing solr query. But when 
the number of clauses and/or number of nested parentheses reach a certain level 
I can no longer grasp what the query is about by just a quick glance. Sometimes 
I can look at the code generating the query, but it might be autogenerated in a 
complex way, or I might only have access to a log output of the query.

Here is an example query, based on a real query in our system:


system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] 
((boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS])) 
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR date5:* OR 
date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]))


Here I find it quite difficult to what clauses are grouped together (using 
parentheses). What I tend to do in these circumstances is to copy the query 
into a simple text editor, and then manually add line breaks and indentation 
matching the parentheses levels.

For the query above, it would result in something like this:


system:(a)
type:(x OR y OR z)
date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]
(
  (boolean1:false OR date2:[* TO 
2019-08-31T06:15:00Z/DAY-30DAYS])
)
-date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *]
(
  ((*:* -date4:*) OR date5:* OR date3:[* TO 
2019-08-31T06:15:00Z/DAY+1DAYS])
)


But that is a slow process, and I might make a mistake that messes up the 
interpretation completely. Especially when there are several levels of nested 
parentheses.

Does anyone know of any kind of tool that would help automate this? It wouldn't 
have to format its output like my example, as long as it makes it easier to see 
what start and end parentheses belong to each other, preferably using multiple 
lines and indentation.

A java tool would be perfect, because then I could easily integrate it into our 
existing debugging tools, but an online formatter (like 
http://jsonformatter.curiousconcept.com) would also be very useful.

Regards
/Jimi

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa 
mer om vår behandling och dina rättigheter, 
Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Debugging Solr Search results & Issues with Distributed IDF

2019-01-02 Thread Charlie Hull

On 01/01/2019 23:03, Lavanya Thirumalaisami wrote:


Hi,

I am trying to debug a query to find out why one documentgets more score than 
the other. The below are two similar products.


You might take a look at OSC's Splainer http://splainer.io/ or some of 
the other tools I've written about recently at 
http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ 
- note that this also covers some commercial offerings (and also that 
I'm very happy to take any comments or additions!).


Cheers

Charlie


Below is the debug results I get from Solr admin console.

  "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n    4.7573533= weight(All:2x in 962) 
[], result of:\n   4.7573533 = score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n   3.4598935 
= idf(docFreq=1346, docCount=42836)\n    1.375 = tfNorm, computed from:\n  2.0 = termFreq=2.0\n   
   1.2 = parameter k1\n  0.0 = parameter b (norms omitted forfield)\n  10.452296 = max of:\n    
5.9166136 = weight(All:powerpoint in 962)[], result of:\n  5.9166136 =score(doc=962,freq=2.0 = 
termFreq=2.0\n), product of:\n    4.302992 = idf(docFreq=579,docCount=42836)\n    1.375 = 
tfNorm,computed from:\n  2.0 =termFreq=2.0\n  1.2 = parameterk1\n  0.0 = parameter b 
(normsomitted for field)\n    10.452296 =weight(All:\"socket outlet\" in 962) [], result of:\n  
10.452296 = score(doc=962,freq=2.0 =phraseFreq=2.0\n), product of:\n   7.60167 = idf(), sum of:\n 
3.5370626 = idf(docFreq=1246, docCount=42836)\n  4.064607 = idf(docFreq=735,docCount=42836)\n    
1.375 = tfNorm,computed from:\n  2.0 =phraseFreq=2.0\n  1.2 = parameterk1\n  0.0 = 
parameter b (normsomitted for field)\n",

"Doc15":"\n13.258003 = sum of:\n  5.7317085 = max of:\n    5.7317085 = weight(All:doubl in 
2122) [],result of:\n  5.7317085 =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n    
4.168515 = idf(docFreq=663,docCount=42874)\n    1.375 = tfNorm,computed from:\n  2.0 
=termFreq=2.0\n  1.2 = parameterk1\n  0.0 = parameter b (normsomitted for field)\n    
4.7657394 =weight(All:2x in 2122) [], result of:\n 4.7657394 = score(doc=2122,freq=2.0 = termFreq=2.0\n), 
productof:\n    3.4659925 =idf(docFreq=1339, docCount=42874)\n   1.375 = tfNorm, computed from:\n 
2.0 = termFreq=2.0\n  1.2= parameter k1\n  0.0 = parameterb (norms omitted for field)\n   
 5.390302= weight(All:2g in 2122) [], result of:\n 5.390302 = score(doc=2122,freq=2.0 = termFreq=2.0\n), 
product of:\n    3.9202197 = idf(docFreq=850,docCount=42874)\n    1.375 = tfNorm,computed from:\n 
 2.0 = termFreq=2.0\n  1.2 = parameter k1\n  0.0 = parameter b (norms omitted forfield)\n 
 7.526294 = max of:\n    5.8597584 = weight(All:powerpoint in 2122)[], result of:\n  5.8597584 
=score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n    4.2616425 = 
idf(docFreq=604,docCount=42874)\n    1.375 = tfNorm,computed from:\n  2.0 = termFreq=2.0\n
  1.2 = parameter k1\n  0.0 = parameter b (norms omitted forfield)\n    7.526294 
=weight(All:\"socket outlet\" in 2122) [], result of:\n  7.526294 = score(doc=2122,freq=1.0 
=phraseFreq=1.0\n), product of:\n   7.526294 = idf(), sum of:\n 3.4955401 = idf(docFreq=1300, 
docCount=42874)\n  4.030754 = idf(docFreq=761,docCount=42874)\n    1.0 = tfNorm,computed from:\n  
    1.0 =phraseFreq=1.0\n  1.2 = parameterk1\n  0.0 = parameter b (normsomitted for 
field)\n",

  


My Questions

1.  IDF : I understand from solr documents that IDFis calculated for each 
separate shards, I have added the following stats cacheconfig to solrconfig.xml 
and reloaded collection



But even after that there is no change incalculated IDF.

2.  What are parameter b and parameter K1?

3.  Why there are lots of parameters included in myDoc15 rather than Doc1?

Is there any documentations I can refer to understand thesolr query 
calculations in depth.

We are using  Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3 replicas.

Regards,
Lavanya




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Questions about the IndexUpgrader tool.

2018-12-19 Thread Charlie Hull

On 18/12/2018 17:40, Erick Erickson wrote:

You are far better off re-indexing totally.


I would add '...if you have the original data'. Not everyone *can* 
re-index, and there are various hairy ways of updating an index in 
place, but they require deep-level magic.


But if you have the original source data, you should re-index.

Cheers

Charlie


Using IndexUpgraderTool has never guaranteed compatibility
across multiple major releases. I.e. if you have an index built
with 4x, using that tool will work for 5x, but then going from 5x
to 6x _even after the entire index is rewritten from 4 x format_
has  never been guaranteed to work. By "guaranteed to work"
here, I mean that there can be subtle problems, regardless
of appearances

The two most succinct statements as to why this is true follow.
I will not second guess _anything_ these two people have to
say about how Lucene works ;)

  From Mike McCandless:
“This really is the difference between an index and a database:
we do not store, precisely, the original documents.  We store an
efficient derived/computed index from them.”

  From Robert Muir:
“I think the key issue here is Lucene is an index not a database.
Because it is a lossy index and does not retain all of the user's
data, its not possible to safely migrate some things automagically...
The function is y = f(x) and if x is not available its not possible, so
lucene can't do it.”

As of 6x, a marker is written into each segments and the lowest
version is retained when segments are merged. 8x will refuse
to start if it detects a 6x marker so this will be enforced soon.

Best,
Erick

On Mon, Dec 17, 2018 at 12:27 PM Pushkar Raste  wrote:


Hi,
I have questions about the IndexUpgrader tool.

- I want to upgrade from Solr 4 to Solr 7. Can I run upgrade the index from
4 to 5 then 5 to 6 and finally 6 to 7 using appropriate version of the
IndexUpgrader but without loading the Index in the Solr at all during the
successive upgrades.

- The note in the tool says "This tool only keeps last commit in an index".
Does this mean I have optimize the index before running the tool?

- There is another note about partially upgraded index. How can the index
be partially upgraded. One scenario I can think of is 'If I upgraded let's
say from Solr 5 to Solr 6 and then added some documents. The new documents
will be in Lucerne 6 format already, while old documents will still be Solr
5 format’ Is my understanding correct?



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr Cloud - Store Data using multiple drives -2

2018-11-22 Thread Charlie Hull

On 22/11/2018 11:50, Tech Support wrote:

Dear Solr Team,


I am using SOLR 7.5.0 in Windows OS (SOLR Cloud). My primary need is , If
the current data storage drive is full, I need to use another one drive
without moving the existing data into the new location.
  


If I add new the dataDir location in the core.properties file, new data only
available on the Solr. If we move the existing data into the new location
then only I can access the old indexed data.


Without moving the existing data is it possible to use the multiple data
directory in SOLR ?


You've already had some good and useful answers in a previous thread, so 
I'm not sure why you're asking the question again...but here goes:


You are asking whether it is possible to split a Solr /core/ across two 
data drives. I don't think that is possible as you've since found out, 
as there can only ever be one data directory set for a core.


However it is possible to create a Solr /collection/ that consists of 
multiple cores. You /shard/ the collection into several parts, each of 
one resides in a different core. You can then easily search over all 
these parts by addressing the collection in your search request. Each 
core could use a different data drive. This usually assumes you know how 
big your index will be and how many parts it needs splitting into, 
although there are ways to re-shard after the fact using the SolrCloud 
Collections API.


If you just want to keep adding disks as your data grows, you could also 
use an /alias/ across several /collections/, with each collection having 
one or more /cores/ on different data drives. Again this alias feature 
is available via the  SolrCloud Collections API.


(I think I've got that all right - this stuff can be confusing and the 
difference between cores, shards, collections etc. not always clear. 
This page is very helpful to understand the basic concepts 
https://lucene.apache.org/solr/guide/7_3/how-solrcloud-works.html#how-solrcloud-works)


I'd recommend reading up about Solr Cloud and thinking more about how to 
plan how to distribute your index before you start.


Another thing to think about is how you know that a disk is getting full 
- you can use Solr's metrics for this and we've also written a proxy 
that will block further updates if a disk is getting full - see 
http://www.flax.co.uk/blog/2016/04/21/running-disk-space-elasticsearch-solr/


HTH,

Charlie



  


Thanks,

Karthick Ramu





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr cloud change collection index directory

2018-11-14 Thread Charlie Hull

On 13/11/2018 22:34, Shawn Heisey wrote:

If it's important for you to have the data separated from the program, 
setting the solr home is in my opinion the right way to go.  This 
separation is achieved by the service installer script that Solr 
includes, which runs on most operating systems other than Windows.  A 
service installer for Windows is something that's been on my mind to try 
and pursue, but there's never enough time.


The standard (but not only) way to install Solr as a Windows service is 
using NSSM and there are multiple guides available online. One *could* 
take these and write a detailed addendum to the Solr Ref Guide "Taking 
Solr to Production" page but it might be hard to cover the various ways 
to do this (batch files, Powershell scripts, runnable installers, Win32 
vs Win64) and produce a definitive best practice guide.


However, perhaps a short paragraph suggesting where else to look might 
be useful.


Cheers

Charlie


Thanks,
Shawn




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Slow import from MsSQL and down cluster during process

2018-10-23 Thread Charlie Hull

On 23/10/2018 02:57, Daniel Carrasco wrote:

annoyingHello,

I've a Solr Cluster that is created with 7 machines on AWS instances. The
Solr version is 7.2.1 (b2b6438b37073bee1fca40374e85bf91aa457c0b) and all
nodes are running on NTR mode and I've a replica by node (7 replicas). One
node is used to import, and the rest are just for serve data.

My problem is that I'm having problems from about two weeks with a MsSQL
import on my Solr Cluster: when the process becomes slow or takes too long,
the entire cluster goes down.


How exactly are you importing from MsSQL to Solr? Are you using the Data 
Import Handler (DIH) and if so, how?  What evidence do you have that 
this is slow or takes too long?


Charlie


I'm confused, because the main reason to have a cluster is HA, and every
time the import node "fails" (is not really failing, just taking more time
to finish), the entire cluster fails and I've to stop the webpage until
nodes are green again.

I don't know if maybe I've to change something in configuration to allow
the cluster to keep working even when the import freezes or the import node
dies, but is very annoying to wake up at 3AM to fix the cluster.

Is there any way to avoid this?, maybe keeping the import node as NTR and
convert the rest to TLOG?

I'm a bit noob in Solr, so I don't know if I've to sent something to help
to find the problem, and the cluster was created just creating a Zookeeper
cluster, connecting the Solr nodes to that Zk cluster, importing the
collections and adding réplicas manually to every collection.
Also I've upgraded that cluster from Solr 6 to Solr 7.1 and later to Solr
7.2.1.

Thanks and greetings!




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Status of the Zeppelin Solr Interpreter

2018-10-17 Thread Charlie Hull
Also, this was just mentioned in a talk here at Activate:
http://www.streamsolr.tk - the presenter Amrit Sarkar was certainly using
Zeppelin in his talk which would imply Lucidworks are still maintaining the
connectors.

Charlie

On Wed, 17 Oct 2018 at 16:37, Charlie Hull  wrote:

> Eric Pugh of Open Source Connections has used Lucidworks' Spark connector
> to allow SQL queries to be sent to Solr, is that another way you could use?
>
> Cheers
>
> Charlie
>
> On Wed, 17 Oct 2018 at 08:14, Jan Høydahl  wrote:
>
>> Hi
>>
>> What is the status of this project?
>> Looks pretty dead on GitHub: https://github.com/lucidworks/zeppelin-solr
>> Would love to be able to use this in a project.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>>


Re: Status of the Zeppelin Solr Interpreter

2018-10-17 Thread Charlie Hull
Eric Pugh of Open Source Connections has used Lucidworks' Spark connector
to allow SQL queries to be sent to Solr, is that another way you could use?

Cheers

Charlie

On Wed, 17 Oct 2018 at 08:14, Jan Høydahl  wrote:

> Hi
>
> What is the status of this project?
> Looks pretty dead on GitHub: https://github.com/lucidworks/zeppelin-solr
> Would love to be able to use this in a project.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>


Re: Zookeeper external vs internal

2018-10-15 Thread Charlie Hull
It's also important to remember that you don't need a particularly large or
powerful node to run Zookeeper.

Charlie

On Sun, 14 Oct 2018 at 23:57, Shawn Heisey  wrote:

> On 10/14/2018 9:31 PM, Sourav Moitra wrote:
> > My question does running separate zookeeper ensemble in the same boxes
> > provides any advantage over using the solr embedded zookeeper ?
>
> The major disadvantage to having ZK embedded in Solr is this:  If you
> stop or restart the Solr process, part of your ZK ensemble goes down
> too.  It is vastly preferable to have it running as a separate process,
> so that you can restart one of the services without causing disruption
> in the other service.
>
> Thanks,
> Shawn
>
>


Re: Modify the log directory for dih

2018-10-05 Thread Charlie Hull

On 04/10/2018 16:35, Shawn Heisey wrote:

On 10/4/2018 12:30 AM, lala wrote:

Hi,
I am using:

Solr: 7.4
OS: windows7
I start solr using a service on startup.


In that case, I really have no idea where anything is on your system.

There is no service installation from the Solr project for Windows -- 
either you obtained that from somewhere else, or it's something written 
in-house.  Either way, you would need to talk to whoever created that 
service installation for help locating files on your setup.


We usually use NSSM for service-ifying Solr on Windows, I'd recommend 
you consider that. Also, bear in mind that a Windows Service can't 
output to stdout or stderr so some messages simply won't go anywhere - 
but the NSSM documentation is helpful.


Charlie


In general, you need to find the log4j2.xml file that is controlling 
your logging configuration and modify it.  It contains a sample of how 
to log something to a separate file -- the slow query log.  That example 
redirects a specific logger name (which is similar to a full qualified 
class name and in most cases *is* the class name) to a different logfile.


Version 7.4 has a bug when running on Windows that causes a lot of 
problems specific to logging.


https://issues.apache.org/jira/browse/SOLR-12538

That problem has been fixed in the 7.5 release.  You can also fix it by 
editing the solr.cmd script manually.


Additional info: I am developing a web application that uses solr as 
search

engine, I use DIH to index folders in solr using the
FileListEntityProcessor. What I need is logging each index operation in a
file that I can reach & read to be able to detect failed index files 
in the

folder.


The FileListEntityProcessor class has absolutely no logging in it.  If 
you require that immediately, you would need to add logging commands to 
the source code and recompile Solr yourself to produce a package with 
your change.  With an enhancement issue in Jira, we can review what 
logging is suitable for the class, and probably make it work like 
SQLEntityProcessor in that regard.  If that's done the way I think it 
should be, then you could add config in log4j2.xml to could enable DEBUG 
level logging for that class specifically and write its logs to a 
separate logfile.


Thanks,
Shawn




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Unnecessary Components

2018-09-20 Thread Charlie Hull
An interesting problem, perhaps we'll look at this at one of the 
Hackdays we're running soon! Previously we managed to cut down the Solr 
config files to fewer lines than the Apache license statement.


Charlie

On 19/09/2018 21:25, Shawn Heisey wrote:

On 9/19/2018 1:48 PM, oddtyme wrote:

I am helping implement solr for a "downloadable library" of sorts. The
objective is that communities without internet access will be able to 
access
a library's worth of information on a small, portable device. As such, 
I am

working within strict space constraints. What are some non-essential
components of solr that can be cut to conserve space for more 
information?


For basic functionality, the entire contrib directory could probably be 
removed.  That's more than half of the download right there.


Some of the jars in solr-webapp/webapp/WEB-INF/lib can likely be 
removed.  Chances are that you won't need the jars starting with 
"hadoop" - those are for HDFS support.  That's another 11 MB.  If you 
don't need either HDFS or SolrCloud, you can remove the zookeeper jar, 
and I think you can also remove the curator jars.  If you're not 
accessing Solr with a JDBC driver, you won't need the calcite jars. If 
you're not dealing with oriental characters (and sometimes even if you 
ARE), you can probably do without lucene-analyzers-kuromoji.


With careful code analysis, you can probably find other jars that aren't 
needed, but there's not a huge amount of space saving to be gained with 
most of the others.


Thanks,
Shawn




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Haystack, the search relevance conference comes to London on October 2nd 2018

2018-09-17 Thread Charlie Hull

On 21/08/2018 15:14, Charlie Hull wrote:

Hi all,

We're very happy to announce the first Haystack Europe conference in 
London on October 2nd.


Hi all,

Just to note the full conference programme is now up, including talks on 
Learning to Rank, tools for visualising and tuning relevance, building 
search relevance teams and more. Hope to see some of you there!

https://opensourceconnections.com/events/haystack-europe-2018/

Cheers

Charlie


https://opensourceconnections.com/events/haystack-europe-2018/

Come and hear talks by Doug Turnbull, co-author of Relevant Search, 
Karen Renshaw, Head of Search and Content for Grainger Global Online and 
other relevance experts, plus the usual networking and knowledge sharing.


Hope to meet some of you there!

Cheers

Charlie




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: MLT in Cloud Mode - Not Returning Fields?

2018-09-03 Thread Charlie Hull

On 31/08/2018 19:36, Doug Turnbull wrote:

Hello,

We're working on a Solr More Like This project (Solr 6.6.2), using the More
Like This searchComponent. What we note is in standalone Solr, when we
request MLT using the search component, we get every more like this
document fully formed with complete fields in the moreLikeThis section.


Hey Doug,

IIRC there wasn't a lot of support for MLT in cloud mode a few years 
ago, and there are certainly still a few open issues around cloud support:

https://issues.apache.org/jira/browse/SOLR-4414
https://issues.apache.org/jira/browse/SOLR-5480
Maybe there are some hints in the ticket comments about different ways 
to do what you want.


Cheers

Charlie



In cloud, however, with the exact same query and config, we only get the
doc ids under "moreLikeThis" requiring us to fetch the metadata associated
with each document.

I can't easily share an example due to confidentiality, but I want to check
if we're missing something? Documentation doesn't mention any limitations.
The only interesting note I've found is this one which points to a
potential difference in behavior


  The Cloud MLT Query Parser uses the realtime get handler to retrieve the

fields to be mined for keywords. Because of the way the realtime get
handler is implemented, it does not return data for fields populated using
copyField.

https://stackoverflow.com/a/46307140/8123

Any thoughts?

-Doug




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Want to start contributing.

2018-08-23 Thread Charlie Hull

On 20/08/2018 18:45, Rohan Chhabra wrote:

Hi all,

I am an absolute beginner (dummy) in the field of contributing open source.
But I am interested in contributing to open source. How do i start? Solr is
a java based search engine based on Lucene. I am good at Java and therefore
chose this to start.

I need guidance. Help required!!



A related topic: we are running two free Lucene Hackdays, in London on 
October 9th and Montreal on October 15th (the week of the Activate 
conference):

https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/253610289/

This would be a great place to meet and learn from existing Lucene 
committers.


Best

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Haystack, the search relevance conference comes to London on October 2nd 2018

2018-08-21 Thread Charlie Hull

Hi all,

We're very happy to announce the first Haystack Europe conference in 
London on October 2nd.


https://opensourceconnections.com/events/haystack-europe-2018/

Come and hear talks by Doug Turnbull, co-author of Relevant Search, 
Karen Renshaw, Head of Search and Content for Grainger Global Online and 
other relevance experts, plus the usual networking and knowledge sharing.


Hope to meet some of you there!

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Hackdays in October, London & Montreal

2018-08-08 Thread Charlie Hull

On 13/07/2018 15:10, Charlie Hull wrote:

On 12/07/2018 10:28, Charlie Hull wrote:

Hi all,

A couple of years ago I ran two free Lucene Hackdays in London and 
Boston (the latter just before Lucene Revolution). Here's what we got 
up to with the kind support of Alfresco, Bloomberg, BA Insight and 
Lucidworks 
http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/ 



I'd like to do this again during the weeks of 8th and 15th October in 
London and Montreal (so just before the Activate event). It's a great 
chance to get together IRL with other Lucene/Solr/Elasticsearch 
hackers! I have a venue for London but a sponsor for evening 
curry/drinks would be wonderful, and for Montreal I still need a venue 
and evening sponsor - do let me know if you or your employer can help.


We have a placeholder event for London!
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/ 


...and we now have a venue for our Montreal event which will be on 
Monday 15th October 
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/253610289/


Hope to see some of you there!

Cheers

Charlie



C


I'll post again once there are more details and with a call for ideas 
as to what we should work on.


Best

Charlie






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Hackdays in October, London & Montreal

2018-07-13 Thread Charlie Hull

On 12/07/2018 10:28, Charlie Hull wrote:

Hi all,

A couple of years ago I ran two free Lucene Hackdays in London and 
Boston (the latter just before Lucene Revolution). Here's what we got up 
to with the kind support of Alfresco, Bloomberg, BA Insight and 
Lucidworks 
http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/


I'd like to do this again during the weeks of 8th and 15th October in 
London and Montreal (so just before the Activate event). It's a great 
chance to get together IRL with other Lucene/Solr/Elasticsearch hackers! 
I have a venue for London but a sponsor for evening curry/drinks would 
be wonderful, and for Montreal I still need a venue and evening sponsor 
- do let me know if you or your employer can help.


We have a placeholder event for London!
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/

C


I'll post again once there are more details and with a call for ideas as 
to what we should work on.


Best

Charlie



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Hackdays in October, London & Montreal

2018-07-12 Thread Charlie Hull

Hi all,

A couple of years ago I ran two free Lucene Hackdays in London and 
Boston (the latter just before Lucene Revolution). Here's what we got up 
to with the kind support of Alfresco, Bloomberg, BA Insight and 
Lucidworks 
http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/


I'd like to do this again during the weeks of 8th and 15th October in 
London and Montreal (so just before the Activate event). It's a great 
chance to get together IRL with other Lucene/Solr/Elasticsearch hackers! 
I have a venue for London but a sponsor for evening curry/drinks would 
be wonderful, and for Montreal I still need a venue and evening sponsor 
- do let me know if you or your employer can help.


I'll post again once there are more details and with a call for ideas as 
to what we should work on.


Best

Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr Issue after the DSE upgrade

2018-06-18 Thread Charlie Hull

On 17/06/2018 03:10, Umadevi Nalluri wrote:

I am getting Connection refused (Connection refused) when I am runnind 
reload_core with dsetool after we setup jmx , this issue is happening since the 
dse upgrade to 5.0.12 , can some one please help with this issue is this a bug?
Is there a work around for this?


dsetool appears to be a utility from Datastax - have you tried asking 
them for support?


Charlie


Thanks
Kantheti




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Parent product show in search result

2018-06-05 Thread Charlie Hull

On 04/06/2018 17:15, Apurba Hazra wrote:

Hello,

We are implementing solr search for our webseite using magento.

Our requirement is, in search result page we have to show only parent
product not all child product if the parent exist, otherwise we have to
show child product.

Will you please tell us how we can do that. Should we change setting in
solr panel as well as magento admin panel.

Please advice us, it's very urgent.


Hi,

How and more importantly *if* you can do this will depend on how Solr 
has been integrated with Magento. Magento documentation, mailing lists 
etc. should be your first port of call.


Best

Charlie



*Thanks & Regards,*
*Apurba Hazra*

*Project Manager*

*Navigator Software Pvt. Ltd.*
Web Applications /  Enterprise Mobility & Mobile Apps / Cloud Solutions /
E-Commerce / Bespoke and Product development / Enterprise CMS / Online POS /
VOIP Solutions / Internet Marketing / Business Intelligence & Analytics /
Dedicated Hiring Solutions.

www.needdevelopers.com
www.boostmysale.com
www.navsoft.in

20 Dr. E Moses Road, Mahalakshmi, Mumbai 400020
205 & 206 Haute Street Bldg., 86A Topsia Road; Kolkata 700046
Tel: (+91-33) 40259595 <00913340259595>




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: How to create a solr collection providing as much searching flexibility as possible?

2018-04-30 Thread Charlie Hull

On 29/04/2018 22:25, Raymond Xie wrote:

Thank you Alessandro,

It looks like my requirement is vague, but indeed I already indicated my
data is in FIX format, which is a  format, here is an example in
the Wiki link in my original question:

8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS |
52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E |
55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY
TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 |

As the data format is quite special, and commonly used in Financial area
(especially for trading data), I believe there must have been lots of
studies already made. That's why I want to find out.


Hi,

Start with the search functionality you want to provide: which fields 
should be covered by a standard search box; which fields should the user 
be able to facet on; which should they be able to sort on. From these 
requirements you should be able to work backwards and decide how to 
index the data appropriately.


Cheers

Charlie



Thank you.




**
*Sincerely yours,*


*Raymond*

On Sat, Apr 28, 2018 at 11:32 AM, Alessandro Benedetti <a.benede...@sease.io

wrote:



Hi Raymond,
your requirements are quite vague, Solr offers you those capabilities but
you need to model your configuration and data accordingly.

https://lucene.apache.org/solr/guide/7_3/solr-tutorial.html
is a good starting point.
After that you can study your requirements and design the search solution
accordingly.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: IndexFetcher cannot download index file

2018-04-24 Thread Charlie Hull

On 24/04/2018 16:44, Walter Underwood wrote:

In Ultraseek, we checked free disk space before starting a merge or 
replication. If there wasn’t enough space, it emailed an error to the admin and 
disabled merging or replication, respectively.

Checking free disk space on Windows was a pain.


On a related topic, we built something that can block connections if 
there's no space to accept new documents for indexing:

https://github.com/flaxsearch/harahachibu

Cheers

Charlie


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 24, 2018, at 8:39 AM, Shawn Heisey <elyog...@elyograg.org> wrote:

On 4/24/2018 6:52 AM, Markus Jelsma wrote:

Forget about it, recovery got a java.io.IOException: No space left on device 
but it wasn't clear until i inspected the real logs.

The logs in de web admin didn't show the disk space exception, even when i 
expand the log line. Maybe that could be changed.


What was the severity of the log entry showing the disk space exception?  Can 
you share the whole message/stacktrace?

If it doesn't show up in the admin UI logging tab, that would suggest that it 
was an INFO level log, when it should probably be ERROR.

Thanks,
Shawn







--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Specialized Solr Application

2018-04-17 Thread Charlie Hull

On 16/04/2018 19:48, Terry Steichen wrote:

I have from time-to-time posted questions to this list (and received
very prompt and helpful responses).  But it seems that many of you are
operating in a very different space from me.  The problems (and
lessons-learned) which I encounter are often very different from those
that are reflected in exchanges with most other participants.


Hi Terry,

Sounds like a fascinating use case. We have some similar clients - small 
scale law firms and publishers - who have taken advantage of Solr.


One thing I would encourage you to do is to blog and/or talk about what 
you've built. Lucene Revolution is worth applying to talk at and if you 
do manage to get accepted - or if you go anyway - you'll meet lots of 
others with similar challenges and come away with a huge amount of 
useful information and contacts. Otherwise there are lots of smaller 
Meetup events (we run the London, UK one).


Don't assume just because some people here are describing their 350 
billion document learning-to-rank clustered monster that the small 
applications don't matter - they really do, and the fact that they're 
possible to build at all is a testament to the open source model and how 
we share information and tips.


Cheers

Charlie


So I thought it would be useful to describe what I'm about, and see if
there are others out there with similar implementations (or interest in
moving in that direction).  A sort of pay-forward.

We (the Lakota Peoples Law Office) are a small public interest, pro bono
law firm actively engaged in defending Native American North Dakota
Water Protector clients against (ridiculously excessive) criminal charges.

I have a small Solr (6.6.0) implementation - just one shard.  I'm using
the cloud mode mainly to be able to implement access controls.  The
server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 8GB of
RAM and 4 cpu processors.  We presently have 8 collections with a total
of about 60,000 documents, mostly pdfs and emails.  The indexed
documents are partly our own files and partly those we obtain through
legal discovery (which, surprisingly, is allowed in ND for criminal
cases).  We only have a few users (our lawyers and a couple of
researchers mostly), so traffic is minimal.  However, there's a premium
on precision (and recall) in searches.

The document repository is local to the server.  I piggyback on the
embedded Jetty httpd in order to serve files (selected from the
hitlists).  I just use a symbolic link to tie the repository to
Solr/Jetty's "webapp" subdirectory.

We provide remote access via ssh with port forwarding.  It provides very
snappy performance, with fully encrypted links.  Appears quite stable.

I've had some bizarre behavior apparently caused by an interaction
between repository permissions, solr permissions and the ssh link.  I
seem "solved" for the moment, but time will tell for how long.

If there are any folks out there who have similar requirements, I'd be
more than happy to share the insights I've gained and problems I've
encountered and (I think) overcome.  There are so many unique parts of
this small scale, specialized application (many dimensions of which are
not strictly internal to Solr) that it probably won't be appreciated to
dump them on this (excellent) Solr list.  So, if you encounter problems
peculiar to this kind of setup, we can perhaps help handle them off-list
(although if they have more general Solr application, we should, of
course, post them to the list).

Terry Steichen




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web
service https://github.com/mattflax/dropwizard-tika-server written by a
colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca>
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and the
> link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu
> vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential or
>
> > legally privileged. If you are not the intended recipient named above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately by
>
> > telephone and then destroy or delete this communication, or return it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to
catch this kind of problem and prevent it bringing down your Solr
installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we
> have in our Sharepoint system. I have used the tika-app.jar directly to
> extract the document in question and it does _not_ throw an exception and
> extract the contents just fine. So it would seem Solr is doing something
> different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML
> document to Tika. As Tika limits nested elements to 100, this causes Tika
> to throw an exception: Suspected zip bomb: 100 levels of XML element
> nesting. This is metioned in TIKA-2091 (https://issues.apache.org/
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> "solution" is to use Tika's default parsing/mapping mechanism but no
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr to
> effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's
> implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>


Re: Query redg : diacritics in keyword search

2018-03-29 Thread Charlie Hull

On 29/03/2018 14:12, Peter Lancaster wrote:

Hi,

You don't say whether the AsciiFolding filter is at index time or query time. 
In any case you can easily look at what's happening using the admin analysis 
tool which helpfully will even highlight where the analysed query and index 
token match.

That said I'd expect what you want to work if you simply use  on both index and query.


Simply put:

You use the filter at indexing time to collapse any variants of a term 
into a single variant, which is then stored in your index.


You use the filter at query time to collapse any variants of a term that 
users type into a single variant, and if this exists in your index you 
get a match.


If you don't use the same filter at both ends you won't get a match.

Cheers

Charlie



Cheers,
Peter.

-Original Message-
From: Paul, Lulu [mailto:lulu.p...@bl.uk]
Sent: 29 March 2018 12:03
To: solr-user@lucene.apache.org
Subject: Query redg : diacritics in keyword search

Hi,

The keyword search Carré  returns values Carré and Carre (this works well as I added the tokenizer 
 in the 
schema config to enable returning of both sets of values)

Now looks like we want Carre to return both Carré and Carre (and this dosen’t 
work. Solr only returns Carre) – any ideas on how this scenario can be achieved?

Thanks & Best Regards,
Lulu Paul



**
Experience the British Library online at www.bl.uk<http://www.bl.uk/> The British 
Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the intended 
recipient, please delete this e-mail and notify the 
postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of findmypast shall be 
understood as neither given nor endorsed by it.


__

This email has been checked for virus and other malicious content prior to 
leaving our network.
______




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr or Elasticsearch

2018-03-22 Thread Charlie Hull

On 22/03/2018 13:13, Steven White wrote:

Hi everyone,

There are some good write ups on the internet comparing the two and the one
thing that keeps coming up about Elasticsearch being superior to Solr is
it's analytic capability.  However, I cannot find what those analytic
capabilities are and why they cannot be done using Solr.  Can someone help
me with this question?


Hi Steve,

As you've said there are lots of writeups, some more out-of-date than 
others. http://solr-vs-elasticsearch.com/ is quite good on features.


The analytics in ES are based on a number of custom aggregations (which 
I always think of as facet-counting-on-steroids, but I realise it's more 
complicated than that). Here's an early doc on them 
https://www.elastic.co/guide/en/elasticsearch/guide/current/_analytics.html 
So you need a good grasp of Elasticsearch's DSL to use these. The 
integration with Kibana is good if you want to display your results.


Solr's analytic capabilities use a Solr Search Component: 
https://lucene.apache.org/solr/guide/7_2/analytics.html . As with a lot 
of Solr features these can appear a lot more complex than 
Elasticsearch's offering. Yonik's blog is also worth reading as he often 
talks about new and upcoming Solr features like this. 
http://yonik.com/solr-facet-functions/


As we've always said, there are few cases where you can't build a 
solution using either engine and I believe that's also true for analytics.


Personally, I'm a Solr user and the thing that concerns me about
Elasticsearch is the fact that it is owned by a company that can  any day
decide to stop making Elasticsearch avaialble under Apache license and even
completely close free access to it.


Yes, but why would they? It would be suicide for a company that have 
such an established open source heritage - not least because a lot of 
Lucene developers who work for Elastic would object. I'd be a bit more 
annoyed about the fact they announced that their commercial XPack 
add-ons would be 'open code' and everyone thinks that means 'open 
source' - which it clearly isn't.


So, this is a 2 part question:

1) What are the analytic capability of Elasticsearch that cannot be done
using Solr?  I want to see a complete list if possible.
2) Should an Elasticsearch user be worried that Elasticsearch may close
it's open-source policy at anytime or that outsiders have no say about it's
road map?


That's a slightly different question about road map - but you do have 
some say, Elastic's developers have always been very helpful and open to 
suggestions from outsiders (who are also users of course!).


Cheers

Charlie


Thanks,

Steve




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull

On 07/03/2018 13:29, lala wrote:

Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
working.?

This is my tika-config.xml file:



 
 
 
 
 true
 true
 
 
 


I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
Any Idea??


Hi,

My reading of 
https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file 
indicates that your PDF parser may not run unless you explicitly exclude 
PDFs, which I don't think you're doing above.


I'm not an expert on Tika configuration, but I think you should first 
try this xml file with standalone Tika and see if it does what you think 
it should. Once you're sure, then try it with DIH or SolrJ.


Cheers

Charlie




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull

On 07/03/2018 09:32, lala wrote:

Thanks for your reply Erick,

Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
Why not benefit from this technology if Solr offers it? It simplifies our
work tremendosely...


It may simplify your work, but it isn't good practice. Tika has some 
heavy lifting to do to extract text from some formats and you should 
consider how this load will affect Solr. We've often put Tika into a 
different process for this reason.



Isn't there any way to be able to extract inline images in PDF docs??


https://stackoverflow.com/questions/31303735/how-to-extract-images-from-a-file-using-apache-tika 
has some useful suggestions.


Charlie


Waiting your reply, best regards...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Word / PDF document snippet rendering in search

2018-03-02 Thread Charlie Hull

On 02/03/2018 00:15, T Wild wrote:

I'm interested in building a software system which will connect to various
document sources, extract the content from the documents contained within
each source, and make the extracted content available to a search engine
such Solr. This search engine will serve as the back-end for a web-based
search application.
This is basically an 'enterprise search' system. You use 'connectors' to 
get text out of the source documents - in Solr applications we often use 
Apache Tika to extract text from common formats like Office or PDF, 
Apache ManifoldCF is another useful project for connecting to repositories.




I'm interested in rendering snippets of these documents in the search
results for well-known types, such as Microsoft Word and PDF. How would one
go about implementing document snippet rendering in search?


If you just want the snippets as text, you can use Solr highlighters 
which can provide contextual snippets (i.e chunks of text around the 
query matches).


I'd be happy with serving up these snippets in any format, including as
images. I just want to be able to give my users some kind of formatted
preview of their results for well-known types.


If you however want to show bits of the original documents that's more 
difficult. You'll need to store a reference to the original document in 
Solr and use an external system to display it - you'll need specific 
systems for different doc types: PDFs can be shown in various browser 
plugins for example. Another approach is illustrated in this open source 
code we wrote a while ago - it uses OpenOffice in 'headless' mode to 
provide images of the source document:

https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen

Hope this helps!

Cheers

Charlie


Thank you!




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: DovValues and in-place udpates

2018-02-12 Thread Charlie Hull

On 12/02/2018 16:02, Brian Yee wrote:

I asked a question here about fast inventory updates last week and I was 
recommended to use docValues with partial in-place updates. I think this will 
work well, but there is a problem I can't think of a good solution for.

Consider this scenario:
InStock = 1 for a product.
InStock changes to 0 which triggers a fast in-place update with docValues.
But it also triggers a slow update that will rebuild the entire document. Let's 
say that takes 10 minutes because we do updates in batches.
During that 5 minutes, InStock changes again to 1 which triggers a fast update 
to solr. So in Solr InStock=1 which is correct.
The slow update finishes and overwrites InStock=0 which is incorrect.

How can we deal with this situation?

It's a slightly crazy idea, but in the past we've solved a similar 
problem by building a custom Lucene codec that is backed by a Redis 
database. You change the stock value in Redis and Lucene doesn't 
actually notice and re-index.

http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

Not sure if this is a better way than DocValues, it was quite a while 
ago and Lucene has moved on a bit since then


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Purchase of support

2018-02-12 Thread Charlie Hull

On 12/02/2018 07:58, Hon Fook Boey wrote:

Hi,

May I know if support/maintenance can be p

urchased for SOLR?

Hi,

Various companies provide support for Solr (including us): what kind of 
support are you looking for?


Best

Charlie


Thanks and regards,

Boey HF

eHoB Technology Sdn Bhd
(Co Reg No 561898-XGST Reg # 001282277376)
No 12-2, Jln PJU 7/16A, Mutiara Damansara, 47800 Petaling Jaya, Malaysia
  Tel +6 03 7710 3308 Fax +6 03 7726 6228 È Mobile +6 012  395 0213 WWW 
www.ehob-tech.com.my





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Opinions on ExtractingRequestHandler

2018-02-08 Thread Charlie Hull

On 08/02/2018 11:47, Frederik Van Hoyweghen wrote:

Hey everyone,

What are your experiences on making (in production) use of Solr's
ExtractingRequestHandler?

I've been reading some mixed remarks so I was wondering what your actual
experiences with it are.

Personally, I feel like setting up a separate service which is solely
responsible for parsing file contents (to be indexed by Solr later on in
the process) using Tika is a safer approach, so we can use whatever Tika
version we want along with other things we might want to add.


Yes, do this. It's entirely possible to bring down Tika with a nasty 
PDF, or end up consuming lots of resources in the extraction step and 
have these impact your Solr server. Run it separately and you can 
monitor it/kill it if necessary.


You might like my colleague Matt Pearce's DropWizard wrapper for Tika 
https://github.com/mattflax/dropwizard-tika-server


Cheers

Charlie


Looking forward to your response!

Kind regards,
Frederik




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Relevancy Tuning For Solr With Apache Nutch 2.3

2018-02-08 Thread Charlie Hull

On 07/02/2018 21:59, Mukhopadhyay, Aratrika wrote:

Hello ,
  I am attempting to tune my results that I retrieve from solr to boost 
the importance of certain fields. The syntax of the query I am using is as 
follows :
http://localhost:8983/solr/housegov_data/select?indent=on=QUERY=edismax=FIELD1^20.0_FIELD2^0.03=json<http://localhost:8983/solr/housegov_data/select?indent=on=QUERY=edismax=FIELD1%5e20.0_FIELD2%5e0.03=json>.
 The issue is that this is not boosting anything in most cases or it isn't being able to find any documents that 
match this criteria. I have used nutch to crawl websites and indexed the data to solr. I see that nutch applies an 
index time boost as well. Could that have something to do with this ? Can anyone look at the format of this query and 
enlighten me of any mistakes that I am making.


Hi,

- You seem to have two field incorrectly concatenated with an 
underscore: qf=FIELD1^20.0_FIELD2^0.03 - this should be a space or an 
encoded space
- a large boost of 20 combined with a fractional boost of 0.03 worries 
me as it implies that one field is 666 times as important as another, 
are you sure this is the case?
- you should turn off *all* the boosts, including the Nutch one, and 
start again, *gently* applying boosts where you can *prove* they improve 
relevancy
- you should consider using a tool such as Quepid (disclaimer: we resell 
this, but there's a free trial period you can use) for relevancy tuning 
based on a set of test cases


HTH,

Charlie




FYI : I am using a data driven schema.
Regards,
Aratrika Mukhopadhyay




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: External file fields

2018-02-02 Thread Charlie Hull

On 01/02/2018 18:55, Brian Yee wrote:

Hello,

I want to use external file field to store frequently changing inventory and 
price data. I got a proof of concept working with a mock text file and this 
will suit my needs.

What is the best way to keep this file updated in a fast way. Ideally I would 
like to read changes from a Kafka queue and write to the file. But it seems 
like I would have to open the whole file, read the whole file, find the line I 
want to change, and write the whole file for every change. Is there a better 
way to do that? That approach seems like it would be difficult/slow if the file 
is several million lines long.

Also, once I come up with a way to update the file quickly, what is the best 
way to distribute the file to all the different solrcloud nodes in the correct 
directory?

Another approach would be the XJoin plugin we wrote - if you wait a few 
days we should have an updated patch for Solr v6.5 and possibly v7. 
XJoin lets you filter/join/rank Solr results using an external data source.


http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/
http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/


Cheers

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Adding virtual host in Jetty (Solr deployed)

2018-02-01 Thread Charlie Hull

On 01/02/2018 12:40, solr2020 wrote:

Hi,

We have installed solr which is running in jetty 9x version. We are trying
to change the default solr url to required URL as given below.

Default url: http://localhost:8983/solr

Required URL :http://test.com/solr

To achieve this we are trying to configure virtual host in jetty
(solr-jetty-context.xml) with the below jetty documentation reference
(https://wiki.eclipse.org/Jetty/Howto/Configure_Virtual_Hosts). But it is
not working.

You're going to need to give more details I'm afraid, such as exactly 
what you expect it to do and what happens when you test it.


Cheers

Charlie




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Distributed search cross cluster

2018-01-31 Thread Charlie Hull

On 30/01/2018 16:09, Jan Høydahl wrote:

Hi,

A customer has 10 separate SolrCloud clusters, with same schema across all, but 
different content.
Now they want users in each location to be able to federate a search across all 
locations.
Each location is 100% independent, with separate ZK etc. Bandwidth and latency 
between the
clusters is not an issue, they are actually in the same physical datacenter.

Now my first thought was using a custom  parameter, and let the 
receiving node fan
out to all shards of all clusters. We’d need to contact the ZK for each 
environment and find
all shards and replicas participating in the collection and then construct the 
shards=A1|A2,B1|B2…
sting which would be quite big, but if we get it right, it should “just work".

Now, my question is whether there are other smarter ways that would leave it up 
to existing Solr
logic to select shards and load balance, that would also take into account any 
shard.keys/_route_
info etc. I thought of these
   * =collA,collB  — but it only supports collections local to one 
cloud
   * Create a collection ALIAS to point to all 10 — but same here, only local 
to one cluster
   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want it 
with pure search API
   * Write a custom ShardHandler plugin that knows about all clusters — but 
this is complex stuff :)
   * Write a custom SearchComponent plugin that knows about all clusters and adds 
the = param

Another approach would be for the originating cluster to fan out just ONE 
request to each of the other
clusters and then write some SearchComponent to merge those responses. That 
would let us query
the other clusters using one LB IP address instead of requiring full visibility 
to all solr nodes
of all clusters, but if we don’t need that isolation, that extra merge code 
seems fairly complex.

So far I opt for the custom SearchComponent and = param approach. Any 
useful input from
someone who tried a similar approach would be priceless!


Hi Jan,

We actually looked at this for the BioSolr project - a SolrCloud of 
SolrClouds. Unfortunately the funding didn't appear for the project so 
we didn't take it any further than some rough ideas - as you say, if you 
get it right it should 'just work'. We had some extra complications in 
terms of shared partial schemas...


Cheers

Charlie


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Using Solr with SharePoint Online

2018-01-30 Thread Charlie Hull

On 30/01/2018 07:57, Mohammed.Adnan2 wrote:

Hello Team,

I am a beginner learning Apache Solr. I am trying to check the compatibility of 
solr with SharePoint Online, but I am not getting anything concrete related to 
this in the website documentation. Can you please help me in providing some 
information on this? How I can index my SharePoint content with solr and then 
use solr on my SharePoint sites? I really appreciate your help on this.

Thanks,
Adnan


Hi Adnan,

There are various things you need to consider:
1. Why do you need Solr at all - Sharepoint Online has its own built-in 
search engine.
2. Installing Solr on a Windows server with access to Sharepoint Online 
- shouldn't be a huge problem, Solr is a Java application so you'll also 
need Java installed. You might want to run Solr as a Windows Service so 
it's always there in the background - look up NSSM.
3. You need a way to get the content out of Sharepoint and into Solr. 
The best way to do this will be to crawl the Sharepoint site. There are 
some commercially available connectors from BA Insight and Lucidworks or 
you'll have to roll your own. This https://github.com/golincode/SPOC 
might be a good starting point. If you go this route you'll certainly 
need to condition the data before you index it with Solr, so you'll have 
to understand how Solr schemas, analyzers etc. work.
4. Then you'll need a UI to talk to Solr to carry out queries - if this 
is to live within the Sharepoint world you'll need to write a web 
application compatible with Sharepoint.


HTH,

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


  1   2   3   >