Solr 8.2 - Added Field - can't facet using alias

2019-10-11 Thread Joe Obernberger

Hi All, I've added a field with:

curl -X POST -H 'Content-type:application/json' --data-binary 
'{"add-field":{"name":"FaceCluster","type":"plongs","stored":false,"multiValued":true,"indexed":true}}' 
http://miranda:9100/solr/UNCLASS_2019_8_5_36/schema


It returned success.  In the UI, when I examine the schema, it shows up 
but does not list 'schema' with the check-boxes for Indexed/DocValues 
etc..  It only lists Properties for FaceCluster.  Other plong fields 
that were added a while back and show both properties and schema.
While I can facet on this field using an alias, I get 'Error from server 
at null: undefined field: FaceCluster'.  If I search an individual solr 
collection, I can facet on it.


Any ideas?

-Joe



RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Davis, Daniel (NIH/NLM) [C]
Nuance and ABBYY provide OCR capabilities as well.

Looking at higher level solutions, both indexengines.com and Comvault can do 
email remediation for legal issues.

> -Original Message-
> From: Retro 
> Sent: Friday, October 11, 2019 8:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Tesseract OCR to extract PDF files in EML file attachment
> 
> AJ Weber wrote
> > There are alternative, paid, libraries to parse and extract attachments
> > from EML files as well
> > EML attachments will have a mimetype associated with their metadata.
> 
> Hello, can you give a hint what are those commercial libraries that would do
> the job? We need to index MSG files and OCR attachments within MSG.
> Tesseract can not do this, and I'm having hard time to find the solution.
> Thank you!
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr-8.2.0 Cannot create collection on CentOS 7.7

2019-10-11 Thread Shawn Heisey

On 10/10/2019 11:01 PM, Peter Davie wrote:
I have just installed Solr 8.2.0 on CentOS 7.7.1908.   Java version is 
as follows:


openjdk version "11.0.4" 2019-07-16 LTS




Caused by: java.time.format.DateTimeParseException: Text 
'2019-10-11T04:46:03.971Z' could not be parsed: null




Note that I have tested this and it is working on Windows 10 with Solr 
8.2.0 using the following Java version:


openjdk version "11.0.2" 2019-01-15


This is looking like either a bug in Java or a change in the Java API. 
The code being called here by the core initialization routines is all in 
Java -- Solr is validating that the Java date formatter it's trying to 
use can parse ISO date strings, and on the version of Java where this 
occurs, that validation is failing.


I haven't been keeping an eye on our Jenkins tests, so I do not know if 
we have automated testing with OpenJDK 11.0.4 ... but it seems like if 
we do, that a large number of tests would be failing because of this ... 
unless maybe your install of OpenJDK has something wrong with it.


I fired up the cloud example on a fresh solr-8.2.0 download on Ubuntu 18 
with OpenJDK 11.0.4.  It created its default "gettingstarted" collection 
with no problems, and then I also created a new collection with the 
bin/solr command with no problems.  I do not have CentOS in my little lab.


Thanks,
Shawn


Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Retro
AJ Weber wrote
> There are alternative, paid, libraries to parse and extract attachments 
> from EML files as well
> EML attachments will have a mimetype associated with their metadata.

Hello, can you give a hint what are those commercial libraries that would do
the job? We need to index MSG files and OCR attachments within MSG. 
Tesseract can not do this, and I'm having hard time to find the solution.
Thank you!



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: igain query parser generating invalid output

2019-10-11 Thread Joel Bernstein
This sounds like a great patch. I can help with the review and commit after
the jira is created.

Thanks!

Joel


On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:

> Hi,
>
> I apologise in advance for the length of this email, but I want to share
> my discovery steps to make sure that I haven't missed anything during my
> investigation...
>
> I am working on a classification project and will be using the
> classify(model()) stream function to classify documents.  I have noticed
> that models generated include many noise terms from the (lexically)
> early part of the term list.  To test, I have used the /BBC articles
> fulltext and category //dataset from Kaggle/
> (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
> indexed the data into a Solr collection (news_categories) and am
> performing the following operation to generate a model for documents
> categorised as "BUSINESS" (only keeping the 100th iteration):
>
> having(
>  train(
>  news_categories,
>  features(
>  news_categories,
>  zkHost="localhost:9983",
>  q="*:*",
>  fq="role:train",
>  fq="category:BUSINESS",
>  featureSet="business",
>  field="body",
>  outcome="positive",
>  numTerms=500
>  ),
>  fq="role:train",
>  fq="category:BUSINESS",
>  zkHost="localhost:9983",
>  name="business_model",
>  field="body",
>  outcome="positive",
>  maxIterations=100
>  ),
>  eq(iteration_i, 100)
> )
>
> The output generated includes "noise" terms, such as the following
> "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
> "09", and these terms all have the same value for idfs_ds ("-Infinity").
>
> Investigating the "features()" output, it seems that the issue is that
> the noise terms are being returned with NaN for the score_f field:
>
>  "docs": [
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "1,011.15",
>  "idf_d": "-Infinity",
>  "index_i": 1,
>  "id": "business_1"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "10.3m",
>  "idf_d": "-Infinity",
>  "index_i": 2,
>  "id": "business_2"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "01",
>  "idf_d": "-Infinity",
>  "index_i": 3,
>  "id": "business_3"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "02",
>  "idf_d": "-Infinity",
>  "index_i": 4,
>  "id": "business_4"
>},...
>
> I have examined the code within
> org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
> see that the scores being returned by {!igain} include NaN values, as
> follows:
>
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":20,
>  "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
>  "role:train",
>  "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569982496170"}},
>"featuredTerms":[
>  "0","NaN",
>  "0.0051","NaN",
>  "0.01","NaN",
>  "0.02","NaN",
>  "0.03","NaN",
>
> Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
> seems that when a term is not included in the positive or negative
> documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
> that subsequent calculations result in NaN (division by 0) which
> generates these meaningless values for the computed score.
>
> I have patched a local version of Solr to skip terms for which docFreq
> is 0 in the finish() method of IGainTermsQParserPlugin and this is now
> the result:
>
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":260,
>  "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
>  "role:train",
>  "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569983546342"}},
>"featuredTerms":[
>  "3",-0.0173133558644304,
>  "authority",-0.0173133558644304,
>  "brand",-0.0173133558644304,
>  "commission",-0.0173133558644304,
>  "compared",-0.0173133558644304,
>  "condition",-0.0173133558644304,
>  "continuing",-0.0173133558644304,
>  "deficit",-0.0173133558644304,
>  "expectation",-0.0173133558644304,
>
> To my (admittedly inexpert) eye, it seems like this is producing more