from:"Phil Scadden"

RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden

Code for solrj is going to be very dependent on your needs but the beating 
heart of my code is below ( note that I do OCR as separate step before feeding 
files into indexer). Solrj and tika docs should help.

File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {

Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, 
null,String.format("File %s failed", f.getCanonicalPath()));
   e.printStackTrace();
   writeLog(String.format("File %s failed", f.getCanonicalPath()));
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath());
 up.addField("location",idString);
 up.addField("title",title);
 up.addField("author",author);
etc for all your fields.
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"prindex");
 req.commit(solr, "prindex");

-Original Message-
From: Srinivas Kashyap 
Sent: Tuesday, 25 August 2020 17:04
To: solr-user@lucene.apache.org
Subject: RE: PDF extraction using Tika

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There are 
around 30 pdf files and I tried indexing single file, but faced same error. Is 
it related to how PDF stored in linux?

And with regard to DIH and TIKA going away, can you share if any program which 
extracts from PDF and pushes into solr?

Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way below 
Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing not. I 
would try to narrow down which of the files it is. One way could be to get a 
standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably complain with 
the same error.

Regards,
   Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You may have a 
much less brittle pipeline if you save the structured outputs from those Tika 
standalone runs and then index them into Solr, possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:
>
> Hello,
>
> We are using TikaEntityProcessor to extract the content out of PDF and make 
> the content searchable.
>
> When jetty is run on windows based machine, we are able to successfully load 
> documents using full import DIH(tika entity). Here PDF's is maintained in 
> windows file system.
>
> But when jetty solr is run on linux machine, and try to run DIH, we
> are getting below exception: (Here PDF's are maintained in linux
> filesystem)
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
> at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
> at 
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content

RE: How do you restrict access to Solr?

2020-03-16 Thread Phil Scadden

First off, use basic authentication to at least partially lock it down. Only 
the application server has access to the password. Second, our IT people 
thought Solr security insufficient to even remotely consider exposing to 
external web. It lives behind firewall so do a kind of proxy. External queries 
are passed to an internal application server which examines, modifies and add 
security to queries and then passes to SOLR. Results sent back up chain to 
external application server. I believe variations of this is what is expected. 
Our deconstruct/reconstruct queries are unusual but it does allow us to use a 
rights-based access to functionality. Ie general public can do searches against 
the title,author, abstract. Privileged and internal users can query against the 
full text of the technical reports.

-Original Message-
From: Ryan W 
Sent: Tuesday, 17 March 2020 03:44
To: solr-user@lucene.apache.org
Subject: How do *you* restrict access to Solr?

How do you, personally, do it?  Do you use IPTables?  Basic Authentication 
Plugin? Something else?

I'm asking in part so I'l have something to search for.  I don't know where I 
should begin, so I figured I would ask how others do it.

I haven't been able to find anything that works, so if you can tell me what 
works for you, I can at least narrow it down a bit and do some Google searches. 
 Do I need to learn Solr's plugin system?  Am I starting in the right place if 
I follow this document:
https://lucene.apache.org/solr/guide/7_0/rule-based-authorization-plugin.html#rule-based-authorization-plugin

Initially, the above document seems far too comprehensive for my needs.  I just 
want to block access to the Solr admin UI, and the list of predefined 
permissions in that document don't seem to be relevant.  Also, it seems 
unlikely this plugin system is necessary just to control access to the admin 
UI... or maybe it necessary?

In any case, what is your approach?

I'm using version 7.7.2 of Solr.

Thanks!
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Upgrading tika

2019-03-20 Thread Phil Scadden

While using the update/extract handler is good for test, tika is a heavyweight 
with the risk that a bad document would compromise the solr instance and tika 
even with ordinary docs is a hog.

I wrote code with solrj to do the indexing and run it on completely different 
machine to the solr instance. It just sends SolrDocuments (created from 
analysis by tika) to the server as Erick says. It becomes even more important 
if you are going to incorporate inline OCR into the tika processing (the 
default). Solr docs gives you the outline for the solrj process. I don’t do 
inline OCR.

My workflow is something like this.
Find document to add.
If image PDF convert to searchable PDF via OCR  as searchable PDF is more 
useful document to deliver as result of search.
Submit document to the solrj-based solr indexer.

The core of my indexer is:
 File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {  // this special 
setup of pdf processing is only required to switch OCR off
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
 // exception handling
 }
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
// other addField calls for items extracted from metadata etc.
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"myindex");
 req.commit(solr, "myindex");

-Original Message-
From: Geoffrey Willis 
Sent: Thursday, 21 March 2019 06:52
To: solr-user@lucene.apache.org
Subject: Re: Upgrading tika

Could you expand on that please? I’m currently building a webApp that submits 
documents to Solr/Tika via the update/extract handler and it’s working fine. 
What do you mean when you say “You do not want to have your Solr instance 
processing via Tika”? If that’s a bad design choice please elaborate.
Thanks,
Geoff

> On Mar 19, 2019, at 5:15 PM, Phil Scadden  wrote:
>
> As per Erick advice, I would strongly recommend that you do anything tika in 
> a  separate solrj programme. You do not want to have your solr instance 
> processing via tika.
>
> -Original Message-
> From: Tannen, Lev (USAEO) [Contractor] 
> Sent: Wednesday, 20 March 2019 08:17
> To: solr-user@lucene.apache.org
> Subject: RE: Upgrading tika
>
> Sorry Erick,
> Please disregard my previous message. Somehow I downloaded the version 
> without those two files. I am going to download the latest version solr 8.0.0 
> and try it.
> Best
> Lev Tannen
>
> -Original Message-
> From: Erick Erickson 
> Sent: Tuesday, March 19, 2019 2:48 PM
> To: solr-user 
> Subject: Re: Upgrading tika
>
> Yes, Solr is distributed with Tika. Look in:
> ./solr/contrib/extraction/lib
>
> Tika is upgraded when new versions come out, so the underlying files are 
> whatever are current at the time.
>
> The integration is a fairly loose coupling, if you're using some external 
> program (say a SolrJ program) to parse the files, there's no requirement to 
> use the jars distributed with Solr, use whatever suits your fancy. An 
> external program just constructs a SolrDocument to send to Solr. What you use 
> to create that document is irrelevant. See:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
>
> If you're using the ExtractingRequestHandler, where you just send the 
> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
> anything about individual Tika-related jar files is kind of strange.
>
> If your predecessors wrote some custom code that runs as part of Solr, I 
> don't know what to say...
>
> Best,
> Erick
>
> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
>  wrote:
>>
>> Thank you Shawn.
>> I assumed that tika has been integrated with solr. I the project written 
>> before me they used two tika files taken from solr distribution. I am trying

RE: Upgrading tika

2019-03-19 Thread Phil Scadden

As per Erick advice, I would strongly recommend that you do anything tika in a  
separate solrj programme. You do not want to have your solr instance processing 
via tika.

-Original Message-
From: Tannen, Lev (USAEO) [Contractor] 
Sent: Wednesday, 20 March 2019 08:17
To: solr-user@lucene.apache.org
Subject: RE: Upgrading tika

Sorry Erick,
Please disregard my previous message. Somehow I downloaded the version without 
those two files. I am going to download the latest version solr 8.0.0 and try 
it.
Best
Lev Tannen

-Original Message-
From: Erick Erickson 
Sent: Tuesday, March 19, 2019 2:48 PM
To: solr-user 
Subject: Re: Upgrading tika

Yes, Solr is distributed with Tika. Look in:
./solr/contrib/extraction/lib

Tika is upgraded when new versions come out, so the underlying files are 
whatever are current at the time.

The integration is a fairly loose coupling, if you're using some external 
program (say a SolrJ program) to parse the files, there's no requirement to use 
the jars distributed with Solr, use whatever suits your fancy. An external 
program just constructs a SolrDocument to send to Solr. What you use to create 
that document is irrelevant. See:
https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.

If you're using the ExtractingRequestHandler, where you just send the 
semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
anything about individual Tika-related jar files is kind of strange.

If your predecessors wrote some custom code that runs as part of Solr, I don't 
know what to say...

Best,
Erick

On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
 wrote:
>
> Thank you Shawn.
> I assumed that tika has been integrated with solr. I the project written 
> before me they used two tika files taken from solr distribution. I am trying 
> to do the same with solr 7.7.1. However this version contains a different set 
> of tika related files. So I am confused. Does  solr does not have integrated 
> tika anymore, or I just cannot recognize them?
>
> -Original Message-
> From: Shawn Heisey 
> Sent: Tuesday, March 19, 2019 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Upgrading tika
>
> On 3/19/2019 9:03 AM, levtannen wrote:
> > Could anybody suggest me what files do I need to use the latest
> > version of Tika and where to find them?
>
> This mailing list is solr-user.  Tika is an entirely separate project from 
> Solr within the Apache Foundation.  To get help with Tika, you'll need to ask 
> that project.
>
> https://tika.apache.org/mail-lists.html
>
> Thanks,
> Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Solr OCR Support

2018-11-04 Thread Phil Scadden

I would strongly consider OCR offline, BEFORE loading the documents into Solr. 
The  advantage of this is that you convert your OCRed PDF into searchable PDF. 
Consider someone using Solr and they have found a document that matches their 
search criteria. Once they retrieve the document, they will discover it is has 
not been OCRed and they cannot use a text search within a document. If the 
document that you are feeding Solr is large, then this is major pain. Setting 
up Tesseract (or whatever engine - tesseract involves a bit of a tool chain) to 
OCR and save as searchable PDF, means you can provide a much more useful 
document as the result of Solr search. Feed that searchable PDF to SolrJ with 
OCR turned off.

   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);

-Original Message-
From: Furkan KAMACI 
Sent: Saturday, 3 November 2018 03:30
To: solr-user@lucene.apache.org
Subject: Solr OCR Support

Hi All,

I want to index images and pdf documents which have images into Solr. I test it 
with my Solr 6.3.0.

I've installed tesseract at my computer (Mac). I verify that Tesseract works 
fine to extract text from an image.

I index image into Solr but it has no content. However, as far as I know, I 
don't need to do anything else to integrate Tesseract with Solr.

I've checked these but they were not useful for me:

http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html

My question is, how can I support OCR with Solr?

Kind Regards,
Furkan KAMACI
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden

I will second the SolrJ method. You don’t want to be doing this on your SOLR 
instance. One question is whether your PDFs are scanned or are already 
searchable. I use tesseract offline to convert all scanned PDFs into searchable 
PDF so I don’t want Tika to be doing that. My code core is:
File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); 
// Remove this line (in fact remove the whole pdfparserConfig if you want tika 
to OCR
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   e.printStackTrace();
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath()); // load up whatever fields 
you are using
 up.addField("location",idString);
 up.addField("access",access);
 up.addField("datasource",datasource);
 up.addField("title",title);
 up.addField("author",author);
 if (year>0) up.addField("year",year);
 if (opfyear>0) up.addField("opfyear",opfyear);
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"prindex");
 req.commit(solr, "prindex");
 return true;

-Original Message-
From: Erick Erickson 
Sent: Wednesday, 31 October 2018 06:00
To: solr-user 
Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA

All of the above work, but for robust production situations you'll want to 
consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines 
indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau  wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to
> upload a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-sto
> re-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you
> define and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-up
> load
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin
> 
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin
> > Putra (you can call me Kevin), from Indonesia, i am a beginner in
> > backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache 
> > TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache
> > TIKA. I understand how SOLR or TIKA works, but i don't know how they
> > both integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse
> > it into data/meta data automatically. And i just have to copy &
> > paste it to the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it
> > only with CLI mode ? if yes only with CLI mode, can you explain it
> > to me please ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my
> > local system, then i just upload it like "drag & drop" in SOLR (is
> > it possible ?) then when i type something in search box the result is like 
> > this :
> > (Title of doc)
> > blablablabla (yellow stabilo result) blablabla.
> > the blablabla text is like a couple sentences. That's all i need.
> > Sorry for my bad english.
> > Thanks for reading and replying this for me, it will be very helpful to me.
> > Thanks a lot
> >
Notice:

RE: Solr Read-Only?

2018-03-07 Thread Phil Scadden

I would also second the proxy approach. Beside keeping your solr instance 
behind a firewall and not directly exposed, you can do a lot in a proxy. 
Per-user control over which index they are access, filtering of queries, etc.

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, 7 March 2018 10:19 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Solr Read-Only?

Hi Terry,
Maybe you can try alternative approaches like putting some proxy in front of 
Solr and configure it to let only certain URLs. Other option is to define 
custom update request processor chain that will not include 
RunUpdateProcessorFactory - that will prevent accidental index updates.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 6 Mar 2018, at 22:55, Terry Steichen  wrote:
>
> Chris,
>
> Thanks for your suggestion.  Restarting solr after an in-memory
> corruption is, of course, trivial (compared to rebuilding the indexes).
>
> Are there any solr directories that MUST be read/write (even with a
> pre-built index)?  Would it suffice (for my purposes) to make only the
> data/index directory R-O?
>
> Terry
>
>
> On 03/06/2018 04:20 PM, Christopher Schultz wrote:
>> Terry,
>>
>> On 3/6/18 4:08 PM, Terry Steichen wrote:
>>> Is it possible to run solr in a read-only directory?
>>
>>> I'm running it just fine on a ubuntu server which is accessible only
>>> through SSH tunneling.  At the platform level, this is fine:
>>> only authorized users can access it (via a browser on their machine
>>> accessing a forwarded port).
>>
>>> The problem is that it's an all-or-nothing situation so everyone
>>> who's authorized access to the platform has, in effect,
>>> administrator privileges on solr.  I understand that authentication
>>> is coming, but that it isn't here yet.  (Or, to add complexity, I
>>> had to downgrade from 7.2.1 to 6.4.2 to overcome a new bug
>>> concerning indexing of eml files, and 6.4.2 definitely doesn't have
>>> authentication.)
>>
>>> Anyway, what I was wondering is if it might be possible to run solr
>>> not as me (the administrator), but as a user with lesser privileges
>>> so that no one who came through the SSH tunnel could (inadvertently
>>> or otherwise) screw up the indexes.
>>
>> With shell access, the only protection you could provide would be
>> through file-permissions. But of course Solr will need to be
>> read-write in order to build the index in the first place. So you'd
>> probably have to run read-write at first, build the index (perhaps
>> that's already been done in the past), then (possibly) restart in
>> read-only mode.
>>
>> Read-only can be achieved by simply revoking write-access to the data
>> directories from the euid of the Solr process. Theoretically, you
>> could switch from being read-write to read-only merely by changing
>> file-permissions... no Solr restarts required.
>>
>> I'm not sure if it matters to you very much, but a user can still do
>> some damage to the index even if the "server" is read-only (through
>> file-permissions): they can issue a batch of DELETE or ADD requests
>> that will effect the in-memory copies of the index. It might be
>> temporary, but it might require that you restart the Solr instance to
>> get back to a sane state.
>>
>> Hope that helps,
>> -chris
>>
>

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Turn on/off query based on a url parameter

2018-02-22 Thread Phil Scadden

I always filter solr request via a proxy (so solr itself is not exposed 
directly to the web). In that proxy, the query parameters can be broken down 
and filtered as desired (I examine authorities granted to a session to control 
even which indexes are being searched) before passing the modified url to solr. 
The coding of the proxy obviously depends on your application environment. We 
use java and Spring.

-Original Message-
From: Roopa Rao [mailto:roop...@gmail.com]
Sent: Friday, 23 February 2018 8:04 a.m.
To: solr-user@lucene.apache.org
Subject: Turn on/off query based on a url parameter

Hi,

I want to enable or disable a SolrFeature in LTR based on efi parameter.

In simple the query should be executed only if a parameter is true.

Any examples or suggestion on how to accomplish this?

Functions queries examples are are using fields to give a value to. In my case 
I want to execute the query only if a url parameter is true

Thanks,
Roopa
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden

Well I have a lot OCRed PDF, but the extremely slow text extract is hard to pin 
down. The bulk of the OCRed one arent too slow, but then I have one that will 
take several minutes.  I use a little utility, pdftotext.exe, for making a 
crude guess at whether OCR is necessary and it is much faster (but not that 
easy to use in the indexing workflow). Some of the  big modern ones (fully 
digital) can also be very slow. Maybe the amount of inline imagery?? Doesn’t 
seem to bother pdftotext.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Friday, 8 December 2017 3:36 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Alternatives to tika for extracting text out of PDFs

No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.

I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 7, 2017, at 5:29 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
>
> Erick
>
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden

I am indexing PDFs and a separate process has converted any image PDFs to 
search PDF before solr gets near it. I notice that tika is very slow at parsing 
some PDFs. I don't need any metadata (which I suspect is slowing tika down), 
just the text. Has anyone used an alternative PDF text extraction library in a 
SOLRJ context?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Multiple cores versus a "source" field.

2017-12-05 Thread Phil Scadden

Thanks Walter. Your case does apply as both data stores do indeed cover the 
same kind of material, with many important terms in common. "source" + fq: 
coming up.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Tuesday, 5 December 2017 5:51 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Multiple cores versus a "source" field.

One more opinion on source field vs separate collections for multiple corpora.

Index statistics don’t really settle down until at least 100k documents. Below 
that, idf is pretty noisy. With Ultraseek, we used pre-calculated frequency 
data for collections under 10k docs.

If your corpora have similar word statistics, you might get more predictable 
relevance with a single collection. For example, if you have data sheets and 
press releases, but they are both about test instruments, then you might get 
some advantage from having more data points about the “text” and “title” fields.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2017, at 7:17 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> Thanks Eric. I have already followed the solrj indexing very closely - I have 
> to do a lot of manipulation at indexing time. The other blog article is very 
> interesting as I do indeed use "year" (year of publication) and it is very 
> frequently used to filter queries. I will have a play with that now.
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, 5 December 2017 4:11 p.m.
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Multiple cores versus a "source" field.
>
> That's the unpleasant part of semi-structued documents (PDF, Word, whatever). 
> You never know the relationship between raw size and indexable text.
>
> Basically anything that you don't care to contribute to _scoring_ is often 
> better in an fq clause. You can also use {!cache=false} to bypass actually 
> using the cache if you know it's unlikely to be reused.
>
> Two other points:
>
> 1> you can offload the parsing to clients rather than Solr and gain
> more control over the process (assuming you haven't already). Here's a blog:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> 2> One reason to not go to fq clauses (except if you use
> {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, 
> one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the 
> subject:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
>
>
> Best,
> Erick
>
>
> On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>>> You'll have a few economies of scale I think with a single core, but 
>>> frankly I don't know if they'd be enough to measure. You say the docs are 
>>> "quite large" though, >are you talking books? Magazine articles? is 20K 
>>> large or are the 20M?
>>
>> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
>> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>>
>> Thanks for tip on fq: I will put that into code now as I have other fields 
>> used is similar fashion.
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden

Thanks Eric. I have already followed the solrj indexing very closely - I have 
to do a lot of manipulation at indexing time. The other blog article is very 
interesting as I do indeed use "year" (year of publication) and it is very 
frequently used to filter queries. I will have a play with that now.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, 5 December 2017 4:11 p.m.
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Multiple cores versus a "source" field.

That's the unpleasant part of semi-structued documents (PDF, Word, whatever). 
You never know the relationship between raw size and indexable text.

Basically anything that you don't care to contribute to _scoring_ is often 
better in an fq clause. You can also use {!cache=false} to bypass actually 
using the cache if you know it's unlikely to be reused.

Two other points:

1> you can offload the parsing to clients rather than Solr and gain
more control over the process (assuming you haven't already). Here's a blog:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> One reason to not go to fq clauses (except if you use
{!cache=false}) is if you are using bare NOW in your clauses for, say ranges, 
one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the 
subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>>You'll have a few economies of scale I think with a single core, but frankly 
>>I don't know if they'd be enough to measure. You say the docs are "quite 
>>large" though, >are you talking books? Magazine articles? is 20K large or are 
>>the 20M?
>
> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>
> Thanks for tip on fq: I will put that into code now as I have other fields 
> used is similar fashion.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden

>You'll have a few economies of scale I think with a single core, but frankly I 
>don't know if they'd be enough to measure. You say the docs are "quite large" 
>though, >are you talking books? Magazine articles? is 20K large or are the 20M?

Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of 
imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.

Thanks for tip on fq: I will put that into code now as I have other fields used 
is similar fashion.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden

I have two different document stores that I want index. Both are quite small 
(<50,000 documents though documents can be quite large). They are quite capable 
of using the same schema, but you would not want to search both simultaneously. 
I can see two approaches to handling this case.
1/ Create a "source" field and use that identify which store is being used. The 
search interface add the appropriate " AND source=" to queries.
2/ Create separate core for each.

If you want to use the same solr server to handle queries to both stores, which 
is the best approach in terms of minimizing JVM size while keeping searches 
reasonably fast?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: adding documents to a secured solr server.

2017-11-02 Thread Phil Scadden

Yes, that worked.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Thursday, 2 November 2017 6:14 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.

On 11/1/2017 10:04 PM, Phil Scadden wrote:
> For testing, I changed to HttpSolrClient and specifying the core on process 
> and commit instead of opening it as server/core. This time worked... sort of. 
> Despite deleting the entire index with deletebyquery and seeing that it was 
> empty in the coreAdmin, I get :
>
> possible analysis error: cannot change DocValues type from SORTED_SET to 
> NUMERIC for field "access"
>
> I tried deleting the field in the admin interface and then adding it back in 
> again in that admin interface. But, no. Still comes up with that error. I 
> know deleting the index files on disk works but I don’t have access to the 
> server. This is a frustrating problem.

Variations of this error happen when settings on a field with docValues="true" 
are changed, and the index already has documents added with the previous 
settings.

Each Lucene segment stores information about what kind of docValues are present 
for each field that has docValues, and if you change an aspect of the field 
(multivalued, field class, etc) and try to add a new document with that 
different information, Lucene will complain.  The reason that deleting all 
documents didn't work is that when you delete documents, they are only MARKED 
as deleted, the segments (and deleted
docs) remain on the disk.

The only SURE way to fix it is to completely delete the index directory (or 
directories), reload the core/collection (or restart Solr), and reindex from 
scratch.  One thing you *might* be able to do if you don't have access to the 
server is delete all documents and then optimize the index, which should delete 
all segments and effectively leave you with a brand new empty index.  I'm not 
100% sure that this would take care of it, but I *think* it would.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

Requested reload and now it indexes with secure server using HttpSolrClietn. 
Phew. I now look to see if I can optimize and get concurrentupdate server to 
work.
At least I can get the index back now.

-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2 November 2017 5:04 p.m.
To: solr-user@lucene.apache.org
Subject: RE: adding documents to a secured solr server.

For testing, I changed to HttpSolrClient and specifying the core on process and 
commit instead of opening it as server/core. This time worked... sort of. 
Despite deleting the entire index with deletebyquery and seeing that it was 
empty in the coreAdmin, I get :

possible analysis error: cannot change DocValues type from SORTED_SET to 
NUMERIC for field "access"

I tried deleting the field in the admin interface and then adding it back in 
again in that admin interface. But, no. Still comes up with that error. I know 
deleting the index files on disk works but I don’t have access to the server. 
This is a frustrating problem.

-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:55 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.

On 11/1/2017 8:13 PM, Phil Scadden wrote:
> 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner:
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e
> eba4a
> 14:52:46,224  WARN ConcurrentUpdateSolrClient:343 - Failed to parse
> error response from http://online-dev.gns.cri.nz:8983/solr/prindex due
> to: java.lang.RuntimeException: Invalid version (expected 2, but 60)
> or the data in not in 'javabin' format

> Even more puzzling. Authentication is set. What is the invalid version bit?? 
> I think my solrj is 6.4.1; the server is 6.6.2. Do these have  to match 
> exactly??

The only time I would be worried about different SolrJ and Solr versions is 
when using the CloudSolrClient object.  For the other client types, you can 
usually have a VERY wide version spread without problems.  For the cloud 
object, you *might* have problems with different versions, or it might work 
fine.  If the SolrJ version is higher than the Solr version, the cloud client 
tends to work.

I would always recommend that the client version be the same or higher than the 
server version... but with non-cloud clients, it won't matter very much.  I 
would not expect problems with the two versions you have, as long as you don't 
try to use the cloud client.

This error is different.  It's happening because SolrJ is expecting a Javabin 
response, but it is getting an HTML error response instead, with the "require 
authentication" error.  This logging message will happen anytime SolrJ gets an 
error response instead of a "real" response.  What this error says is 
technically correct, but very confusing to novice users.

The specific numbers in the message are a result of the first character of the 
response.  With javabin, the first character would be 0x02 to indicate the 
javabin version of 2, but with HTML, the first character is the opening angle 
bracket, or character number 60 (0x3C).  This is where those two numbers come 
from.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

For testing, I changed to HttpSolrClient and specifying the core on process and 
commit instead of opening it as server/core. This time worked... sort of. 
Despite deleting the entire index with deletebyquery and seeing that it was 
empty in the coreAdmin, I get :

possible analysis error: cannot change DocValues type from SORTED_SET to 
NUMERIC for field "access"

I tried deleting the field in the admin interface and then adding it back in 
again in that admin interface. But, no. Still comes up with that error. I know 
deleting the index files on disk works but I don’t have access to the server. 
This is a frustrating problem.

-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:55 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.

On 11/1/2017 8:13 PM, Phil Scadden wrote:
> 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner:
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e
> eba4a
> 14:52:46,224  WARN ConcurrentUpdateSolrClient:343 - Failed to parse
> error response from http://online-dev.gns.cri.nz:8983/solr/prindex due
> to: java.lang.RuntimeException: Invalid version (expected 2, but 60)
> or the data in not in 'javabin' format

> Even more puzzling. Authentication is set. What is the invalid version bit?? 
> I think my solrj is 6.4.1; the server is 6.6.2. Do these have  to match 
> exactly??

The only time I would be worried about different SolrJ and Solr versions is 
when using the CloudSolrClient object.  For the other client types, you can 
usually have a VERY wide version spread without problems.  For the cloud 
object, you *might* have problems with different versions, or it might work 
fine.  If the SolrJ version is higher than the Solr version, the cloud client 
tends to work.

I would always recommend that the client version be the same or higher than the 
server version... but with non-cloud clients, it won't matter very much.  I 
would not expect problems with the two versions you have, as long as you don't 
try to use the cloud client.

This error is different.  It's happening because SolrJ is expecting a Javabin 
response, but it is getting an HTML error response instead, with the "require 
authentication" error.  This logging message will happen anytime SolrJ gets an 
error response instead of a "real" response.  What this error says is 
technically correct, but very confusing to novice users.

The specific numbers in the message are a result of the first character of the 
response.  With javabin, the first character would be 0x02 to indicate the 
javabin version of 2, but with HTML, the first character is the opening angle 
bracket, or character number 60 (0x3C).  This is where those two numbers come 
from.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

So the real error is authentication, (the version is spurious) but why that 
when authentication is being set on the updateRequest?

-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:55 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.

On 11/1/2017 8:13 PM, Phil Scadden wrote:
> 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner:
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e
> eba4a
> 14:52:46,224  WARN ConcurrentUpdateSolrClient:343 - Failed to parse
> error response from http://online-dev.gns.cri.nz:8983/solr/prindex due
> to: java.lang.RuntimeException: Invalid version (expected 2, but 60)
> or the data in not in 'javabin' format

> Even more puzzling. Authentication is set. What is the invalid version bit?? 
> I think my solrj is 6.4.1; the server is 6.6.2. Do these have  to match 
> exactly??

The only time I would be worried about different SolrJ and Solr versions is 
when using the CloudSolrClient object.  For the other client types, you can 
usually have a VERY wide version spread without problems.  For the cloud 
object, you *might* have problems with different versions, or it might work 
fine.  If the SolrJ version is higher than the Solr version, the cloud client 
tends to work.

I would always recommend that the client version be the same or higher than the 
server version... but with non-cloud clients, it won't matter very much.  I 
would not expect problems with the two versions you have, as long as you don't 
try to use the cloud client.

This error is different.  It's happening because SolrJ is expecting a Javabin 
response, but it is getting an HTML error response instead, with the "require 
authentication" error.  This logging message will happen anytime SolrJ gets an 
error response instead of a "real" response.  What this error says is 
technically correct, but very confusing to novice users.

The specific numbers in the message are a result of the first character of the 
response.  With javabin, the first character would be 0x02 to indicate the 
javabin version of 2, but with HTML, the first character is the opening angle 
bracket, or character number 60 (0x3C).  This is where those two numbers come 
from.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Stateless queries to secured SOLR server.

2017-11-01 Thread Phil Scadden

Thanks for that Shawn. What I am doing is working fine now. I need the middle 
proxy to audit and modify what client sends to solr (based on user rights) not 
to mention keeping solr from direct exposure to internet.

-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:13 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.

On 11/1/2017 4:22 PM, Phil Scadden wrote:
> Except that I am using solrj in an intermediary proxy and passing the
> response directly to a javascript client. It is expect json or csv
> depending on what it passes in wt=

That's a different use case than I had imagined.  Thanks for the detail.

My statement about SolrJ is correct if the code that will handle the response 
is Java.  Sounds like it's not -- you've just said that the code that will 
actually decode and use the response is javascript.

When the code that will handle the response is Java, SolrJ is a perfect fit, 
because SolrJ will handle decoding the response and the programmer doesn't need 
to worry about the format, they are given an object that contains the full 
response, where information can easily be extracted by someone familiar with 
typical Java objects.

There probably is a way to access the full response "text" with SolrJ, rather 
than the decoded object, but I do not know enough about the low-level details 
to tell you how that might be accomplished.  If you can figure that part out, 
then you could use SolrJ and have access to its methods for constructing the 
query.

With your Java code simply acting as a proxy, the way you're going about it 
might be the best option -- build the http request with a particular wt 
parameter, get the response, and pass the response on unmodified.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

And my security.json looks like:
{
  "authentication":{
"class":"solr.BasicAuthPlugin",
"blockUnknown":true,
"credentials":{
  "solrAdmin":" a hash ",
  "solrGuest":"another hash"},
"":{"v":0}},
  "authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[
  {
"name":"all",
"role":"admin"},
  {
"name":"read",
"role":"guest"}],
"user-role":{"solrAdmin":["admin","guest"],"solrGuest":"guest"}}}

It looks like I should be able to add.

this one worked to delete the entire index:
   UpdateRequest up = new UpdateRequest();
   up.setBasicAuthCredentials("solrAdmin",password);
   up.deleteByQuery("*:*");
   up.setCommitWithin(1000);
   up.process(solr);

-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2 November 2017 2:59 p.m.
To: solr-user@lucene.apache.org
Subject: RE: adding documents to a secured solr server.

After some digging, I tried this approach...
   solr = new ConcurrentUpdateSolrClient.Builder(solrUrl)
   .withQueueSize(20)
   .build();
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.setCommitWithin(1000);
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr);

However,  I get error back of:
14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner: 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a
14:52:46,224  WARN ConcurrentUpdateSolrClient:343 - Failed to parse error 
response from http://online-dev.gns.cri.nz:8983/solr/prindex due to: 
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in 
not in 'javabin' format
14:52:46,224 ERROR ConcurrentUpdateSolrClient:540 - error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://online-dev.gns.cri.nz:8983/solr/prindex: require 
authentication

request: 
http://online-dev.gns.cri.nz:8983/solr/prindex/update?wt=javabin=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:345)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:184)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14:52:46,224 DEBUG ConcurrentUpdateSolrClient:210 - finished: 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a

Even more puzzling. Authentication is set. What is the invalid version bit?? I 
think my solrj is 6.4.1; the server is 6.6.2. Do these have  to match exactly??

-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2 November 2017 11:28 a.m.
To: solr-user@lucene.apache.org
Subject: adding documents to a secured solr server.

Solrj QueryRequest object has a method to set basic authorization 
username/password but what is the equivalent way to pass authorization when you 
are adding new documents to an index?
   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
...
 up.addField("id","myid");
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
 solr.commit();

I cant see where authorization occurs?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Instit

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

After some digging, I tried this approach...
   solr = new ConcurrentUpdateSolrClient.Builder(solrUrl)
   .withQueueSize(20)
   .build();
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.setCommitWithin(1000);
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr);

However,  I get error back of:
14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner: 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a
14:52:46,224  WARN ConcurrentUpdateSolrClient:343 - Failed to parse error 
response from http://online-dev.gns.cri.nz:8983/solr/prindex due to: 
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in 
not in 'javabin' format
14:52:46,224 ERROR ConcurrentUpdateSolrClient:540 - error
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://online-dev.gns.cri.nz:8983/solr/prindex: require 
authentication

request: 
http://online-dev.gns.cri.nz:8983/solr/prindex/update?wt=javabin=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:345)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:184)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14:52:46,224 DEBUG ConcurrentUpdateSolrClient:210 - finished: 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a

Even more puzzling. Authentication is set. What is the invalid version bit?? I 
think my solrj is 6.4.1; the server is 6.6.2. Do these have  to match exactly??

-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2 November 2017 11:28 a.m.
To: solr-user@lucene.apache.org
Subject: adding documents to a secured solr server.

Solrj QueryRequest object has a method to set basic authorization 
username/password but what is the equivalent way to pass authorization when you 
are adding new documents to an index?
   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
...
 up.addField("id","myid");
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
 solr.commit();

I cant see where authorization occurs?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden

Solrj QueryRequest object has a method to set basic authorization 
username/password but what is the equivalent way to pass authorization when you 
are adding new documents to an index?
   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
...
 up.addField("id","myid");
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
 solr.commit();

I cant see where authorization occurs?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Stateless queries to secured SOLR server.

2017-11-01 Thread Phil Scadden

Except that I am using solrj in an intermediary proxy and passing the response 
directly to a javascript client. It is expect json or csv depending on what it 
passes in wt=

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Thursday, 2 November 2017 2:48 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.

On 10/31/2017 2:08 PM, Phil Scadden wrote:
> Thanks Shawn. I have done it with SolrJ. Apart from needing the 
> NoopResponseParser to handle the wt=, it was pretty painless.

This is confusing to me, because with SolrJ, you do not need to be concerned 
with the response format *AT ALL*.  You don't need to use the wt parameter, 
SolrJ will handle that for you.  In fact, you should NOT set the wt parameter.

Thanks,
Shawn
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Stateless queries to secured SOLR server.

2017-10-31 Thread Phil Scadden

Thanks Shawn. I have done it with SolrJ. Apart from needing the 
NoopResponseParser to handle the wt=, it was pretty painless.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, 1 November 2017 2:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.

On 10/29/2017 6:13 PM, Phil Scadden wrote:
> While SOLR is behind a firewall, I want to now move to a secured SOLR 
> environment. I had been hoping to keep SOLRJ out of the picture and just 
> using httpURLConnection. However, I also don't want to maintain session 
> state, preferring to send authentication with every request. Is this possible 
> with basic Authorization?

I do not know a lot about the authentication in Solr, but I do know that it's 
typically using HTTP basic authentication.  As I understand it, for this kind 
of authentication, every request will require the credentials.

I am not aware of any state/session capability where Solr's HTTP API is 
concerned.  As far as I know, the closest Solr comes to this is that certain 
things, particularly the Collections API, are async capable, where you start a 
process with one HTTP call and then you can make further requests to check 
whether it's done.

If your software is written in Java, I would strongly recommend SolrJ, rather 
than constructing the HTTP calls yourself.  The code is easier to write and 
understand.  For other languages, there are third-party Solr client libraries 
available.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Stateless queries to secured SOLR server.

2017-10-29 Thread Phil Scadden

While SOLR is behind a firewall, I want to now move to a secured SOLR 
environment. I had been hoping to keep SOLRJ out of the picture and just using 
httpURLConnection. However, I also don't want to maintain session state, 
preferring to send authentication with every request. Is this possible with 
basic Authorization?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

solr and machine learning - recommendations?

2017-10-05 Thread Phil Scadden

Now that I am got a big hunk of documents indexed with Solr, I am looking to 
see whether I can try some machine learning tools to try and extract 
bibliographic references out of the documents. Anyone got some recommendations 
about which kits might be good to play with for something like this?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: DocValues, Long and SolrJ

2017-09-26 Thread Phil Scadden

The delete for additions is done with:

   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
   try {
solr.deleteByQuery("*:*");
solr.commit();
   } catch (SolrServerException | IOException ex) {

   }

// start the index rebuild

-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Wednesday, 27 September 2017 10:04 a.m.
To: solr-user@lucene.apache.org
Subject: RE: DocValues, Long and SolrJ

I get it after I have deleted the index with a delete query and start trying to 
populate it again with new documents. The error occurs when the indexer tries 
to add a new document. And yes, I did change the schema before I started the 
operation.

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
Sent: Tuesday, 26 September 2017 8:49 p.m.
To: solr-user@lucene.apache.org
Subject: Re: DocValues, Long and SolrJ

Hi Phil,
Are you saying that you get this error when you create fresh core/collection? 
This sort of errors are usually related to schema being changed after some 
documents being indexed.

Thanks,
Emir

> On 25 Sep 2017, at 23:42, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> I ran into a problem with indexing documents which I worked around by 
> changing data type, but I am curious as to how the setup could be made to 
> work.
>
> Solr 6.5.1 - Field type Long, multivalued false, DocValues.
>
> In indexing with Solr, I set the value of field with:
>Long accessLevel
>...
>accessLevel = qury.val(1);
>...
>Document.addField("access", accessLevel);
>
> Solr fails to add the document with this message:
>
> "cannot change DocValues type from SORTED_SET to NUMERIC for field"
>
> ??? So how do you configure a single-valued Long type?
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: DocValues, Long and SolrJ

2017-09-26 Thread Phil Scadden

I get it after I have deleted the index with a delete query and start trying to 
populate it again with new documents. The error occurs when the indexer tries 
to add a new document. And yes, I did change the schema before I started the 
operation.

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
Sent: Tuesday, 26 September 2017 8:49 p.m.
To: solr-user@lucene.apache.org
Subject: Re: DocValues, Long and SolrJ

Hi Phil,
Are you saying that you get this error when you create fresh core/collection? 
This sort of errors are usually related to schema being changed after some 
documents being indexed.

Thanks,
Emir

> On 25 Sep 2017, at 23:42, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> I ran into a problem with indexing documents which I worked around by 
> changing data type, but I am curious as to how the setup could be made to 
> work.
>
> Solr 6.5.1 - Field type Long, multivalued false, DocValues.
>
> In indexing with Solr, I set the value of field with:
>Long accessLevel
>...
>accessLevel = qury.val(1);
>...
>Document.addField("access", accessLevel);
>
> Solr fails to add the document with this message:
>
> "cannot change DocValues type from SORTED_SET to NUMERIC for field"
>
> ??? So how do you configure a single-valued Long type?
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

DocValues, Long and SolrJ

2017-09-25 Thread Phil Scadden

I ran into a problem with indexing documents which I worked around by changing 
data type, but I am curious as to how the setup could be made to work.

Solr 6.5.1 - Field type Long, multivalued false, DocValues.

In indexing with Solr, I set the value of field with:
Long accessLevel
...
accessLevel = qury.val(1);
...
Document.addField("access", accessLevel);

Solr fails to add the document with this message:

"cannot change DocValues type from SORTED_SET to NUMERIC for field"

??? So how do you configure a single-valued Long type?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Solr update failing on remote server but works locally??

2017-09-24 Thread Phil Scadden

Finally got it. Version difference between local and remote version of solr 
meant different defaults on some fields. Field type Long get DocValues even if 
you leave it unchecked and also changed to multivalue=false. This results in 
"cannot change DocValues type from SORTED_SET to NUMERIC for field". Beats me 
what it expects for values in document.addField(...), but changing the field 
type from Long to Int fixed it.

-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Sunday, 24 September 2017 4:35 p.m.
To: solr-user@lucene.apache.org
Subject: Solr update failing on remote server but works locally??

I am attempted to redo an index job. The delete query worked fine but on 
reindex, I get this:
09:42:51,061 ERROR ConcurrentUpdateSolrClient:463 - error
org.apache.solr.common.SolrException: Bad Request



request: http://online-uat:8983/solr/prindex/update?wt=javabin=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:290)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:161)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

The same code works fine when the solr instance is local on my test machine. 
The remote machine is reachable. The prindex core seems fine. A delete query to 
remove old index executed fine. But update fails. I am not sure to begin 
looking. "Sendupdatestream" would indicate solr.up is trying to send the string 
but some the request format is wrong
http://online-uat:8983/solr/prindex/select?q=*:*
just returns:


0
0

*:*


 

Which is consistant with deleting everything but having the syntax correct?

The updating code is:
   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);



File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   e.printStackTrace();
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath());
 up.addField("location",idString);
 up.addField("access",access);
 up.addField("datasource",datasource);
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Solr update failing on remote server but works locally??

2017-09-23 Thread Phil Scadden

I am attempted to redo an index job. The delete query worked fine but on 
reindex, I get this:
09:42:51,061 ERROR ConcurrentUpdateSolrClient:463 - error
org.apache.solr.common.SolrException: Bad Request



request: http://online-uat:8983/solr/prindex/update?wt=javabin=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:290)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:161)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

The same code works fine when the solr instance is local on my test machine. 
The remote machine is reachable. The prindex core seems fine. A delete query to 
remove old index executed fine. But update fails. I am not sure to begin 
looking. "Sendupdatestream" would indicate solr.up is trying to send the string 
but some the request format is wrong
http://online-uat:8983/solr/prindex/select?q=*:*
just returns:


0
0

*:*





Which is consistant with deleting everything but having the syntax correct?

The updating code is:
   ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);



File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   e.printStackTrace();
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath());
 up.addField("location",idString);
 up.addField("access",access);
 up.addField("datasource",datasource);
 up.addField("title",title);
 up.addField("author",author);
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: write.lock file appears and solr wont open

2017-09-04 Thread Phil Scadden

We finally got a resolution to this - trivial but related to trying to do 
things by remote control. The solr process did not have the permissions to 
write to the core that was imported. When it tried to create the lock file it 
failed. The Solr code obviously assumes that file create failure means file 
already exists rather than perhaps insufficient permissions. Checking for file 
existence would result in a more informative message but I am guessing the 
test/production setup when developers are not allowed access to the servers is 
reasonably unique (I hope so anyway because it sucks).

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 26 August 2017 9:15 a.m.
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: write.lock file appears and solr wont open

Odd. The way core discovery works, it starts at SOLR_HOME and recursively 
descends the directories. Whenever the recursion finds a "core.properties" file 
it says "Aha, this must be a core". From there it assumes the data directory is 
immediately below where it found the core.properties file in the absence of any 
dataDir overrides.

So how the write.lock file is getting preserved across Solr restarts is a 
mystery to me. Doing a "kill -9" is one way to make that happen if it is done 
at just the wrong time, but that's unlikely in what you're describing.

Are you totally sure that there were no old Solr processes running?
And there have been some issues in the past where the log display of the admin 
UI hold on to errors and displays them after the problem has been fixed. I'm 
assuming you can't query the new core, is that correct? Because if you can 
query the core then _something_ has the index open. I'm grasping at straws here 
mind you.

Best,
Erick

On Thu, Aug 24, 2017 at 9:02 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> SOLR_HOME is /var/www/solr/data
> The zip was actually the entire data directory which also included 
> configsets. And yes core.properties is in var/www/solr/data/prindex (just has 
> single line name=prindex, in it). No other cores are present.
> The data directory should have been unzipped before the solr instance was 
> started (I cant actually touch the machine so communicating via a deployment 
> document but the operator usually follows every step to the letter.
> The sequence was:
> mkdir /var/www/solr
> sudo bash ./install_solr_service.sh solr-6.5.1.tgz -i /opt/local -d
> /var/www/solr edit /etc/default/solr.in.sh to set various items. (esp
> SOLR_HOME and to set SOLR_PID_DIR to /var/www/solr) unzip the data
> directory service solr start.
>
> No other instance of solr installed.
>
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: query with wild card with AND taking lot of time

2017-09-03 Thread Phil Scadden

5 seems a reasonable limit to me. After that revert to slow.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 2 September 2017 12:01 p.m.
To: solr-user 
Subject: Re: query with wild card with AND taking lot of time

How far would you take that? Say you had 100 terms joined by AND (ridiculous I 
know, just sayin' ). Then you'd chew up 100 entries in the filterCache.

On Fri, Sep 1, 2017 at 4:24 PM, Walter Underwood  wrote:
> Hmm. Solr really should convert an fq of “a AND b” to separate “a” and “b” fq 
> filters. That should be a simple special-case rewrite. It might take less 
> time to implement than explaining it to everyone.
>
> Well, I guess then we’d have to explain how it wasn’t really necessary
> to send separate fq params…
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 1, 2017, at 2:01 PM, Erick Erickson  wrote:
>>
>> Shawn:
>>
>> See: https://issues.apache.org/jira/browse/SOLR-7219
>>
>> Try fq=filter(foo) filter(bar) filter(baz)
>>
>> Patches to docs welcome ;)
>>
>> On Fri, Sep 1, 2017 at 1:50 PM, Shawn Heisey  wrote:
>>> On 9/1/2017 8:13 AM, Alexandre Rafalovitch wrote:
 You can OR cachable filter queries in the latest Solr. There is a
 special
 (filter) syntax for that.
>>>
>>> This is actually possible?  If so, I didn't see anything come across
>>> the dev list about it.
>>>
>>> I opened an issue for it, didn't know anything had been implemented.
>>> After I opened the issue, I discovered that I was merely the latest
>>> to do so, it had been requested before.
>>>
>>> Can you point to the relevant part of the reference guide and the
>>> Jira issue where the change was committed?
>>>
>>> Thanks,
>>> Shawn
>>>
>
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: write.lock file appears and solr wont open

2017-08-24 Thread Phil Scadden

SOLR_HOME is /var/www/solr/data
The zip was actually the entire data directory which also included configsets. 
And yes core.properties is in var/www/solr/data/prindex (just has single line 
name=prindex, in it). No other cores are present.
The data directory should have been unzipped before the solr instance was 
started (I cant actually touch the machine so communicating via a deployment 
document but the operator usually follows every step to the letter.
The sequence was:
mkdir /var/www/solr
sudo bash ./install_solr_service.sh solr-6.5.1.tgz -i /opt/local -d 
/var/www/solr
edit /etc/default/solr.in.sh to set various items. (esp SOLR_HOME and to set 
SOLR_PID_DIR to /var/www/solr)
unzip the data directory
service solr start.

No other instance of solr installed.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

write.lock file appears and solr wont open

2017-08-24 Thread Phil Scadden

I am slowing moving 6.5.1 from development to production. After installing solr 
on the final test machine, I tried to supply a core by zipping up the data 
directory on development and unzipping on test.
When I go to admin I get:
[cid:image001.png@01D31DA9.1B0EF540]
Write.lock obviously causing a block. So deleted the write.lock and restarted. 
I get the same error! What gives? Is it not possible to move a core like this?

Looking at the log I see:
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core [prindex]
 at java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:192)
 at 
org.apache.solr.core.CoreContainer.lambda$load$6(CoreContainer.java:581)
 at 
org.apache.solr.core.CoreContainer$$Lambda$107/390689829.run(Unknown Source)
 at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
 at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
 at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$106/671471369.run(Unknown
 Source)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Unable to create core [prindex]
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:933)
 at 
org.apache.solr.core.CoreContainer.lambda$load$5(CoreContainer.java:553)
 at 
org.apache.solr.core.CoreContainer$$Lambda$105/1138410383.call(Unknown Source)
 at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
 ... 6 more
Caused by: org.apache.solr.common.SolrException: 
/var/www/solr/data/prindex/data/index/write.lock
 at org.apache.solr.core.SolrCore.(SolrCore.java:965)
 at org.apache.solr.core.SolrCore.(SolrCore.java:831)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:918)
 ... 9 more
Caused by: java.nio.file.NoSuchFileException: 
/var/www/solr/data/prindex/data/index/write.lock
 at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
 at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
 at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
 at sun.nio.fs.UnixPath.toRealPath(UnixPath.java:837)
 at 
org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:104)
 at 
org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
 at 
org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
 at 
org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:104)
 at org.apache.lucene.index.IndexWriter.isLocked(IndexWriter.java:4773)
 at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:710)
 at org.apache.solr.core.SolrCore.(SolrCore.java:912)

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Optimizing Dataimport from Oracle; cursor sharing; changing oracle session parameters

2017-08-15 Thread Phil Scadden

Perhaps there is potential to optimize with some PLSQL functions on Oracle side 
to do as much work within database as possible and have the text indexers only 
access a view referencing that function. Also, the obvious optimization is a 
record-updated timestamp so that every time indexer runs, only changed data is 
managed.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, 16 August 2017 5:42 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Optimizing Dataimport from Oracle; cursor sharing; changing oracle 
session parameters

On 8/15/2017 8:09 AM, Mannott, Birgit wrote:
> I'm using solr 6.6.0 and I have to do a complex data import from an oracle db 
> concerning 3.500.000 data rows.
> For each row I have 15 additional entities. That means that more than 52 
> Million selects are send to the database.
> For every select that is done I optimized the oracle execution path by 
> creating indizes.
> The execution plans are ok.
> But the import still lasts 12 hours.

I think the reason it takes 12 hours is because there are 52 million SELECT 
statements.  That many statements over 12 hours is an average of
1200 per second.  This sounds like pretty good database performance.

> Is there a way to execute a command like "ALTER SESSION SET cursor_sharing = 
> FORCE;" after getting the connection for processing an entity?

I think that most JDBC drivers (by default) don't allow multiple SQL statements 
to be sent in a single request, so commands like "SELECT FOO; SELECT BAR" won't 
work.  The idea behind denying this kind of command is protection against SQL 
injection attacks.  There is likely a JDBC URL parameter for the Oracle driver 
that would allow that ... and if there is, then you could add that to the 
connection URL in the DIH config to allow putting the ALTER SESSION statement 
before SELECT in your DIH entity.

The Oracle driver might also have a JDBC URL parameter to turn on the cursor 
sharing you're interested in.  That would be the best way to handle it, if that 
is an option.  You're going to need to consult Oracle documentation or an 
Oracle support resource to find out what URL parameter options there are for 
their driver.

I have near zero experience with Oracle databases, but I suspect that even with 
cursor sharing, you're still going to have the sheer number of SELECT 
statements as a bottleneck.  If there is a performance improvement, it probably 
won't be dramatic.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Storing data in Solr

2017-08-08 Thread Phil Scadden

When I am putting PDF documents and rows from a table into the same index, I 
create "dataSource" field to identify the source and I don't copy database 
fields - only index them - apart from the unique key which is stored as 
"document". On search, you process the output before passing to user. If 
datasource is pdfs etc, then you should have highlighted text to pass on. If 
dataSource is the table, then fetch the rows from database and display the 
search fields as "highlights". A lot of postprocessing of search results but 
easier to create meaningful results if a single row in the table contains what 
a user wants. You need a custom indexer and a custom results postprocesser 
however.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Arabic words search in solr

2017-08-02 Thread Phil Scadden

Hopefully changing to default AND solves your problem. If so, I would be quite 
interested in what your index config looks like in the end. I also have 
upcoming need to index Arabic words.

-Original Message-
From: mohanmca01 [mailto:mohanmc...@gmail.com]
Sent: Thursday, 3 August 2017 12:58 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr

Hi Phil Scadden,

 Thank you for your reply,

we tried your suggested solution by removing hyphen while indexing, but it was 
getting wrong results. i was searching for "شرطة ازكي" and it was showing me 
the result that am looking for, plus irrelevant result which either have the 
first or second word that i have typed while searching.

First word: شرطة
Second Word: ازكي

results that we are getting:


{
  "responseHeader": {
"status": 0,
"QTime": 3,
"params": {
  "indent": "true",
  "q": "bizNameAr:(شرطة ازكي)",
  "_": "1501678260335",
  "wt": "json"
}
  },
  "response": {
"numFound": 444,
"start": 0,
"docs": [
  {
"id": "28107",
"bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  - 
مركز شرطة إزكي",
"_version_": 1574621132849414100
  },
  {
"id": "13937",
"bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
"_version_": 157462113219720
  },
  {
"id": "15914",
"bizNameAr": "العلوي والازكي المتحدة ش.م.م",
"_version_": 1574621132344000500
  },
  {
"id": "20639",
"bizNameAr": "سحائب ازكي للتجارة",
"_version_": 1574621132574687200
  },
  {
"id": "25108",
"bizNameAr": "المستشفيات -  - مستشفى إزكي",
"_version_": 1574621132737216500
  },
  {
"id": "27629",
"bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
"_version_": 1574621132833685500
  },
  {
"id": "36351",
"bizNameAr": "طوارئ الكهرباء - إزكي",
"_version_": 157462113318391
  },
  {
"id": "61235",
"bizNameAr": "اضواء ازكي للتجارة",
"_version_": 1574621133785792500
  },
  {
"id": "66821",
"bizNameAr": "أطلال إزكي للتجارة",
"_version_": 1574621133915816000
  },
  {
"id": "67011",
"bizNameAr": "بنك ظفار - فرع ازكي",
"_version_": 1574621133920010200
  }
]
  }
}

Actually  we expecting the below results only since it has both the words that 
we typed while searching:

  {
"id": "28107",
"bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  - 
مركز شرطة إزكي",
"_version_": 1574621132849414100
  },


Configuration:

In schema.xml we configured as below:





  








  



Thanks,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Arabic words search in solr

2017-07-31 Thread Phil Scadden

Further to that. What results do you get when you put those indexed terms into 
the Analysis tool on the Solr UI?

-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Tuesday, 1 August 2017 9:06 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr

Am I correct in assuming that you have the problem searching only when there is 
a hyphen in your indexed text? If you, then it would suggest that you need to 
use a different tokenizer when indexing - it looks like the hyphen is removed 
and words each side are concatenated - hence need both terms to find the text.

-Original Message-
From: mohanmca01 [mailto:mohanmc...@gmail.com]
Sent: Tuesday, 1 August 2017 1:18 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

Please help me on this...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348372.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Arabic words search in solr

2017-07-31 Thread Phil Scadden

Am I correct in assuming that you have the problem searching only when there is 
a hyphen in your indexed text? If you, then it would suggest that you need to 
use a different tokenizer when indexing - it looks like the hyphen is removed 
and words each side are concatenated - hence need both terms to find the text.

-Original Message-
From: mohanmca01 [mailto:mohanmc...@gmail.com]
Sent: Tuesday, 1 August 2017 1:18 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

Please help me on this...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348372.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Issues trying to boost phrase containing stop word

2017-07-20 Thread Phil Scadden

The simplest suggestion is get rid of the stop word filter. I've seen people 
here comment that it is not worth it for the amount of space it saves.

-Original Message-
From: shamik [mailto:sham...@gmail.com]
Sent: Friday, 21 July 2017 9:49 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Issues trying to boost phrase containing stop word

Any suggestion?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4347068.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Phil Scadden

http -  however, the big advantage of doing your indexing on different machine 
is that the heavy lifting that tika does in extracting text from documents, 
finding metadata etc is not happening on the server. If the indexer crashes, it 
doesn’t affect Solr either.

-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com]
Sent: Tuesday, 20 June 2017 11:29 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr) because 
Python is my main programming language. I have an impression that 1. they send 
HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to the 
server via HTTP or some other more native ways? Is the main benefit of SolrJ 
over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan"  wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just
> trying to figure out what is going on by indexing one or two PDF files
> first. Thank you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson
> 
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher
>> > 
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >> Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned
>> >> > JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > > >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/
>> >> >
>> >> > :
>> >> >
>> >> > > stored="true"/>
>> >> > > indexed="true"
>> >> > stored="false"/>
>> >> > 
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically
>> >> > and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author
>> >> >> 1> identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll
>> >> >> be a copyField directive copying (perhaps) many different
>> >> >> fields to the "text" field. That permits simple searches
>> >> >> against a single field rather than, say, using edismax to
>> >> >> search across multiple separate fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which is
>> >> >> 2> much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >>

RE: CSV output

2017-06-15 Thread Phil Scadden

Embarassing. Yes, it was the proxy. Very old code that has now had a 
considerable refresh.

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com]
Sent: Thursday, 15 June 2017 7:13 p.m.
To: solr-user@lucene.apache.org
Subject: Re: CSV output

Is it the proxy affecting the output?What do you get going directly to 
Solr's endpoint?

   Erik

> On Jun 14, 2017, at 22:13, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> If I try
> /getsolr? 
> fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv
>
> The response I get is:
> id,title,datasource,scoreW:\PR_Reports\OCR\PR869.pdf,,Petroleum 
> Reports,8.233313W:\PR_Reports\OCR\PR3440.pdf,,Petroleum 
> Reports,8.217836W:\PR_Reports\OCR\PR4313.pdf,,Petroleum 
> Reports,8.206703W:\PR_Reports\OCR\PR3906.pdf,,Petroleum 
> Reports,8.185147W:\PR_Reports\OCR\PR1592.pdf,,Petroleum 
> Reports,8.167614W:\PR_Reports\OCR\PR998.pdf,,Petroleum 
> Reports,8.161142W:\PR_Reports\OCR\PR2457.pdf,,Petroleum 
> Reports,8.155497W:\PR_Reports\OCR\PR2433.pdf,,Petroleum 
> Reports,8.152924W:\PR_Reports\OCR\PR1184.pdf,,Petroleum 
> Reports,8.124402W:\PR_Reports\OCR\PR3551.pdf,,Petroleum Reports,8.124402
>
> ie no newline separators at all (Solr 6.5.1) (/getsolr is api that proxy to 
> the solr server).
> Changing it to
> /getsolr?csv.newline=%0A=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv
>
> Makes no difference. What I am doing wrong here? Is there another way to 
> specify csv parameters? It says default is \n but I am not seeing that.
>
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Issue with highlighter

2017-06-14 Thread Phil Scadden

Just had similar issue - works for some, not others. First thing to look at is 
hl.maxAnalyzedChars is the query. The default is quite small.
Since many of my documents are large PDF files, I opted to use 
storeOffsetsWithPositions="true" termVectors="true" on the field I was 
searching on.
This certainly did increase my index size but not too bad and certainly fast.
https://cwiki.apache.org/confluence/display/solr/Highlighting

Beware of NOT plus OR in a search. That will certainly produce no highlights. 
(eg test -results when default op is OR)


-Original Message-
From: Ali Husain [mailto:alihus...@outlook.com]
Sent: Thursday, 15 June 2017 11:11 a.m.
To: solr-user@lucene.apache.org
Subject: Issue with highlighter

Hi,


I think I've found a bug with the highlighter. I search for the word 
"something" and I get an empty highlighting response for all the documents that 
are returned shown below. The fields that I am searching over are text_en, the 
highlighter works for a lot of queries. I have no stopwords.txt list that could 
be messing this up either.


 "highlighting":{
"310":{},
"103":{},
"406":{},
"1189":{},
"54":{},
"292":{},
"309":{}}}


Just changing the search term to "something like" I get back this:


"highlighting":{
"310":{},
"309":{
  "content":["1949 Convention, like those"]},
"103":{},
"406":{},
"1189":{},
"54":{},
"292":{},
"286":{
  "content":["persons in these classes are treated like 
combatants, but in other respects"]},
"336":{
  "content":["   be treated like engagement"]}}}


So I know that I have it setup correctly, but I can't figure this out. I've 
searched through JIRA/Google and haven't been able to find a similar issue.


Any ideas?


Thanks,

Ali
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

CSV output

2017-06-14 Thread Phil Scadden

If I try
/getsolr? 
fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv

The response I get is:
id,title,datasource,scoreW:\PR_Reports\OCR\PR869.pdf,,Petroleum 
Reports,8.233313W:\PR_Reports\OCR\PR3440.pdf,,Petroleum 
Reports,8.217836W:\PR_Reports\OCR\PR4313.pdf,,Petroleum 
Reports,8.206703W:\PR_Reports\OCR\PR3906.pdf,,Petroleum 
Reports,8.185147W:\PR_Reports\OCR\PR1592.pdf,,Petroleum 
Reports,8.167614W:\PR_Reports\OCR\PR998.pdf,,Petroleum 
Reports,8.161142W:\PR_Reports\OCR\PR2457.pdf,,Petroleum 
Reports,8.155497W:\PR_Reports\OCR\PR2433.pdf,,Petroleum 
Reports,8.152924W:\PR_Reports\OCR\PR1184.pdf,,Petroleum 
Reports,8.124402W:\PR_Reports\OCR\PR3551.pdf,,Petroleum Reports,8.124402

ie no newline separators at all (Solr 6.5.1) (/getsolr is api that proxy to the 
solr server).
Changing it to
/getsolr?csv.newline=%0A=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv

Makes no difference. What I am doing wrong here? Is there another way to 
specify csv parameters? It says default is \n but I am not seeing that.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Highlighter not working on some documents

2017-06-12 Thread Phil Scadden

I managed to miss that. Thanks very much. I have some very large documents. I 
will look at index size and look at posting instead.

-Original Message-
From: David Smiley [mailto:david.w.smi...@gmail.com]
Sent: Monday, 12 June 2017 2:40 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Highlighter not working on some documents

Probably the most common reason is the default hl.maxAnalyzedChars -- thus your 
highlightable text might not be in the first 51200 chars of text.  The first 
Solr release with the unified highlighter had an even lower default of 10k 
chars.

On Fri, Jun 9, 2017 at 9:58 PM Phil Scadden <p.scad...@gns.cri.nz> wrote:

> Tried hard to find difference between pdfs returning no highlighter
> and ones that do for same search term.  Includes pdfs that have been
> OCRed and ones that were text to begin with. Head scratching to me.
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Saturday, 10 June 2017 6:22 a.m.
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Highlighter not working on some documents
>
> Need lots more information. I.e. schema definitions, query you use,
> handler configuration and the like. Note that highlighted fields must
> have stored="true" set and likely the _text_ field doesn't. At least
> in the default schemas stored is set to false for the catch-all field.
> And you don't want to store that information anyway since it's usually
> the destination of copyField directives and you'd highlight _those_ fields.
>
> Best,
> Erick
>
> On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> > Do a search with:
> > fl=id,title,datasource=true=unified=50=1=p
> > re
> > ssure+AND+testing=50=0=json
> >
> > and I get back a good list of documents. However, some documents are
> returning empty fields in the highlighter. Eg, in the highlight array have:
> > "W:\\Reports\\OCR\\4272.pdf":{"_text_":[]}
> >
> > Getting this well up the list of results with good highlighted
> > matchers
> above and below this entry. Why would the highlighter be failing?
> >
> > Notice: This email and any attachments are confidential and may not
> > be
> used, published or redistributed without the prior written consent of
> the Institute of Geological and Nuclear Sciences Limited (GNS
> Science). If received in error please destroy and immediately notify
> GNS Science. Do not copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of
> the Institute of Geological and Nuclear Sciences Limited (GNS
> Science). If received in error please destroy and immediately notify
> GNS Science. Do not copy or disclose the contents.
>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: including a minus sign "-" in the token

2017-06-11 Thread Phil Scadden

Looking at the Classic tokenizer I notice that it does not split on hyphen if 
there is a  number in the word. Pretty much exactly what I want. What are the 
downsides to using Classic?

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Monday, 12 June 2017 2:44 a.m.
To: Phil Scadden <p.scad...@gns.cri.nz>
Subject: Re: including a minus sign "-" in the token

On 6/9/2017 8:12 PM, Phil Scadden wrote:
> So, the field I am using for search has type of:
>positionIncrementGap="100" multiValued="true">
> 
>   
>ignoreCase="true"/>
>   
> 
> 
>   
>ignoreCase="true"/>
>ignoreCase="true" synonyms="synonyms.txt"/>
>   
> 
>   
>
> You are saying "wainui-8" will indexed as one token? But I should add a 
> worddelimiterfilter to the analyser to prevent it being split? Or I guess the 
> Worddelimitergraphfilter.

No, I was saying that the query parser won't look at the hyphen in
wainui-8 and treat it as a "NOT" operator.

Whatever you've got for index/query analysis will still take effect after that 
-- and it will do that even if you escape characters with a backslash.

Your index and query analysis are almost the same, but query analysis does 
synonym replacement.  The StandardTokenizerFactory will split "wainui-8" into 
two tokens and remove the hyphen, even if you escape it at query time.

> Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen 
> followed by a number to NOT be treated as a hyphenated. That would mean 
> catenateWords:1 but catenateNumbers:0???
> What would it do with Wainui-10A?

I'm not sure that there is any single built-in analysis component that will do 
what you want.  Your index analysis includes StandardTokenizerFactory, so it is 
going to remove hyphens and split tokens at those locations, whether it is 
followed by numbers or not.
You're going to need to switch to the whitespace tokenizer and add a filter 
(like the word delimeter filter) to do further splitting.  The 
"splitOnNumerics" setting for the word delimeter filter *might* do it, but I'm 
not sure.  It might take a combination of filters.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: including a minus sign "-" in the token

2017-06-09 Thread Phil Scadden

So, the field I am using for search has type of:

You are saying "wainui-8" will indexed as one token? But I should add a 
worddelimiterfilter to the analyser to prevent it being split? Or I guess the 
Worddelimitergraphfilter.

Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen 
followed by a number to NOT be treated as a hyphenated. That would mean 
catenateWords:1 but catenateNumbers:0???
What would it do with Wainui-10A?

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Saturday, 10 June 2017 12:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: including a minus sign "-" in the token

On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have
> convention naming of geographicname-number. Eg Wainui-8 I want the tokenizer 
> to treat it as Wainui-8 when indexing, and when I search I want to a q of 
> Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 
> but not with Wainui-9 or plain Wainui.
>
> Docs are pdfs, and I have using tika to extract text.
>
> How do I set up solr for queries like this?

At indexing time, Solr does not treat the hyphen as a special character like it 
does at query time.  Many analysis components do, though.  If your analysis 
chain includes certain components (the standard tokenizer, the ICU tokenizer, 
and WordDelimeterFilter are on that list), then the hypen may be treated as a 
word break character and the analysis could remove it.

At query time, a hyphen in the middle of a word is not treated as a special 
character.  It would need to be at the beginning of the query text or after a 
space for the query parser to treat it as a negation.
So Wainui-8 would not be a problem, but -7 would, and you'd need to specify it 
as \-7 for it to work like you want.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Highlighter not working on some documents

2017-06-09 Thread Phil Scadden

Tried hard to find difference between pdfs returning no highlighter and ones 
that do for same search term.  Includes pdfs that have been OCRed and ones that 
were text to begin with. Head scratching to me.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 10 June 2017 6:22 a.m.
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Highlighter not working on some documents

Need lots more information. I.e. schema definitions, query you use, handler 
configuration and the like. Note that highlighted fields must have 
stored="true" set and likely the _text_ field doesn't. At least in the default 
schemas stored is set to false for the catch-all field.
And you don't want to store that information anyway since it's usually the 
destination of copyField directives and you'd highlight _those_ fields.

Best,
Erick

On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> Do a search with:
> fl=id,title,datasource=true=unified=50=1=pre
> ssure+AND+testing=50=0=json
>
> and I get back a good list of documents. However, some documents are 
> returning empty fields in the highlighter. Eg, in the highlight array have:
> "W:\\Reports\\OCR\\4272.pdf":{"_text_":[]}
>
> Getting this well up the list of results with good highlighted matchers above 
> and below this entry. Why would the highlighter be failing?
>
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Highlighter not working on some documents

2017-06-09 Thread Phil Scadden

Managed-schema attached (not a default) and the solrconfig.xml. _text_ is 
stored. (not sure how else highlighting could work??).  The indexer puts the 
body text of the pdf into _text_ field. What the value be in putting it into a 
different field and then using copyField??
 Ie
 SolrInputDocument up = new SolrInputDocument();
 String content = textHandler.toString();
 up.addField("_text_",content);

 solr.add(up);

The puzzling thing for me is why are some documents producing highlights and 
others not. The highlighters in the documents that work are pulling body text 
fragments, not things stored in some other field.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 10 June 2017 6:22 a.m.
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Highlighter not working on some documents

Need lots more information. I.e. schema definitions, query you use, handler 
configuration and the like. Note that highlighted fields must have 
stored="true" set and likely the _text_ field doesn't. At least in the default 
schemas stored is set to false for the catch-all field.
And you don't want to store that information anyway since it's usually the 
destination of copyField directives and you'd highlight _those_ fields.

Best,
Erick

On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> Do a search with:
> fl=id,title,datasource=true=unified=50=1=pre
> ssure+AND+testing=50=0=json
>
> and I get back a good list of documents. However, some documents are 
> returning empty fields in the highlighter. Eg, in the highlight array have:
> "W:\\Reports\\OCR\\4272.pdf":{"_text_":[]}
>
> Getting this well up the list of results with good highlighted matchers above 
> and below this entry. Why would the highlighter be failing?
>
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

solrconfig.xml
Description: solrconfig.xml

Highlighter not working on some documents

2017-06-08 Thread Phil Scadden

Do a search with:
fl=id,title,datasource=true=unified=50=1=pressure+AND+testing=50=0=json

and I get back a good list of documents. However, some documents are returning 
empty fields in the highlighter. Eg, in the highlight array have:
"W:\\Reports\\OCR\\4272.pdf":{"_text_":[]}

Getting this well up the list of results with good highlighted matchers above 
and below this entry. Why would the highlighter be failing?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

including a minus sign "-" in the token

2017-06-08 Thread Phil Scadden

We have important entities referenced in indexed documents which have 
convention naming of geographicname-number. Eg Wainui-8
I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I 
want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs 
with Wainui-8 but not with Wainui-9 or plain Wainui.

Docs are pdfs, and I have using tika to extract text.

How do I set up solr for queries like this?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Got a 404 trying to update a solr. 6.5.1 server. /solr/update not found.

2017-06-06 Thread Phil Scadden

Duh! Thanks for that.

-Original Message-
From: tflo...@apple.com [mailto:tflo...@apple.com]
Sent: Tuesday, 6 June 2017 4:25 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Got a 404 trying to update a solr. 6.5.1 server. /solr/update not 
found.

I think you are missing the collection name in the path.

Tomás

Sent from my iPhone

> On Jun 5, 2017, at 9:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> Simple piece of code. Had been working earlier (though against a 6.4.2 
> instance).
>
>  ConcurrentUpdateSolrClient solr = new 
> ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2);
>   try {
>solr.deleteByQuery("*:*");
>solr.commit();
>   } catch (SolrServerException | IOException ex) {
>// logger handler stuff omitted.
>   }
>
> Comes back with:
> 15:53:36,693 DEBUG wire:72 -  << "[\n]"
> 15:53:36,694 DEBUG wire:72 -  << " content="text/html;charset=utf-8"/>[\n]"
> 15:53:36,694 DEBUG wire:72 -  << "Error 404 Not Found[\n]"
> 15:53:36,695 DEBUG wire:72 -  << "[\n]"
> 15:53:36,695 DEBUG wire:72 -  << "HTTP ERROR 404[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "Problem accessing /solr/update. 
> Reason:[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "Not Found[\n]"
> 15:53:36,696 DEBUG wire:72 -  << "[\n]"
> 15:53:36,697 DEBUG wire:72 -  << "[\n]"
>
> If I access http://myhost:8983/solr/update then I get that html too, but 
> http://myhost:8983/solr comes up with admin page as normal so Solr appears to 
> be running okay.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Got a 404 trying to update a solr. 6.5.1 server. /solr/update not found.

2017-06-05 Thread Phil Scadden

Simple piece of code. Had been working earlier (though against a 6.4.2 
instance).

  ConcurrentUpdateSolrClient solr = new 
ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2);
   try {
solr.deleteByQuery("*:*");
solr.commit();
   } catch (SolrServerException | IOException ex) {
// logger handler stuff omitted.
   }

Comes back with:
15:53:36,693 DEBUG wire:72 -  << "[\n]"
15:53:36,694 DEBUG wire:72 -  << "[\n]"
15:53:36,694 DEBUG wire:72 -  << "Error 404 Not Found[\n]"
15:53:36,695 DEBUG wire:72 -  << "[\n]"
15:53:36,695 DEBUG wire:72 -  << "HTTP ERROR 404[\n]"
15:53:36,696 DEBUG wire:72 -  << "Problem accessing /solr/update. 
Reason:[\n]"
15:53:36,696 DEBUG wire:72 -  << "Not Found[\n]"
15:53:36,696 DEBUG wire:72 -  << "[\n]"
15:53:36,697 DEBUG wire:72 -  << "[\n]"

If I access http://myhost:8983/solr/update then I get that html too, but 
http://myhost:8983/solr comes up with admin page as normal so Solr appears to 
be running okay.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Phil Scadden

Yes, that would seem an accurate assessment of the problem.

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Thursday, 30 March 2017 4:53 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR

Thanks for your reply.

From what I see, getting more hardware to do the OCR is inevitable?

Even if we run the OCR outside of Solr indexing stream, it will still take a 
long time to process it if it is on just one machine. And we still need to wait 
for the OCR to finish converting before we can run the indexing to Solr.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing speed reduced significantly with OCR

2017-03-28 Thread Phil Scadden

Well I haven’t had to deal with a problem that size, but it seems to me that 
you have little alternative except through more computer hardware at it. For 
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing 
workflow. I used pdftotext utility to extract text from pdf. If text extracted 
was <1% document size, then I assumed it needed to be OCRed otherwise didn’t 
bother. You could look at a more sophisticated method to determine whether OCR 
was necessary. Doing it outside indexing stream means you can use different 
hardware for OCR. Converting to searchable PDF means you do it only once - a 
reindex doesn’t need to do OCR.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing speed reduced significantly with OCR

2017-03-27 Thread Phil Scadden

Only by 10? You must have quite small documents. OCR is extremely expensive 
process. Indexing is trivial by comparison. For quite large documents I am 
working with OCR can be 100 times slower than indexing a PDF that is searchable 
(text extractable without OCR).

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Tuesday, 28 March 2017 4:13 p.m.
To: solr-user@lucene.apache.org
Subject: Indexing speed reduced significantly with OCR

Hi,

Does the indexing speed of Solr reduced significantly when we are using 
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images 
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Index scanned documents

2017-03-26 Thread Phil Scadden

While building directly into Solr might be appealing, I would argue that it is 
best to use OCR software first, outside of SOLR, to convert the PDF into 
"searchable" PDF format. That way when the document is retrieved, it is a lot 
more useful to the searcher - making it easy to find the text within the PDF.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Finding time of last commit to index from SolrJ?

2017-03-15 Thread Phil Scadden

The admin gui displays the time of last commit to a core but how can this be 
queried from within SolrJ?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

getting "dedupe" management when updating index via tika and SolrJ

2017-03-15 Thread Phil Scadden

I have added a signature field to schema and setup dedupe handler in 
solrconfig.xml as per docs, however docs say:

“Be sure to change your update handlers to use the defined chain, as below:”

Umm, WHERE do you change the update handler to use the defined chain? Is this 
in one of config xmls or this statement for a dataimport config?

I am updating the index via solrj
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
 up.addField("access",access);
 up.addField("title",metadata.get("title"));
 up.addField("author",metadata.get("author"));
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);

Should dedupe be part of this code??
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: https

2017-03-08 Thread Phil Scadden

What we are suggesting is that your browser does NOT access solr directly at 
all. In fact, configure firewall so that SOLR is unreachable outside the 
server. Instead you write a proxy in your site application which calls SOLR 
instead. Ie a server-to-server call instead of browser-to-server. This is a 
much more secure setup and allows you to "vet" query requests, potentially 
distribute to different cores on some application logic etc. Shouldn’t be hard 
to find a skeleton proxy code in whatever your site application is written in.

-Original Message-
From: pubdiverses [mailto:pubdiver...@free.fr]
Sent: Thursday, 9 March 2017 8:12 a.m.
To: solr-user@lucene.apache.org
Subject: Re: https

Hello,

I give you some more explanation.

I have a site https://site.com under Apache.
On the same physical server, i've installed solr.

Inside https://site.com, i've a search form wich call solr with 
http://xxx.xxx.xxx.xxx/solr.

But the browser says : "mixt content" and blocks the call.

So, i need to have something like https://xxx.xxx.xxx.xxx/solr

Is it possible ?

Le 07/03/2017 à 22:19, Alexandre Rafalovitch a écrit :
> The first advise is NOT to expose your Solr directly to the public.
> Anyone that can hit /search, can also hit /update and wipe out your
> index.
>
> Unless you run a proper proxy that secures URLs and sanitizes the
> parameters (in GET, in POST, escaped, etc).  And if you are doing
> that, you can setup the HTTPS in your proxy and have it speak HTTP to
> Solr on the backend.
>
> Otherwise, you need middleware, which runs on a server as well, so you
> are back into configuring _that_ server (not Solr) for HTTPS.
>
> Regards,
> Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>
>
> On 7 March 2017 at 15:45, pubdiverses  wrote:
>> Hello,
>>
>> I would like to acces my solr instance with https://domain.com/solr.
>>
>> how to do this ?

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: https

2017-03-07 Thread Phil Scadden


>The first advise is NOT to expose your Solr directly to the public.
>Anyone that can hit /search, can also hit /update and wipe out your index.

I would second that too. We have never exposed Solr and I also sanitise queries 
in the proxy.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Managed schema vs schema.xml

2017-03-07 Thread Phil Scadden

I would second that guide could be clearer on that. I read and reread several 
times trying to get my head around the schema.xml/managed-schema bit. I came 
away from first cursory reading with the idea that managed-schema was mostly 
for schema-less mode and only after some stuff ups and puzzling over comments 
in the basic-config schema file itself did I go back for more careful re-read. 
I am still not sure that I have got all the nuances. My understanding is:

If you don’t want ability to edit it via admin UI or config api, rename to 
schema.xml. Unclear whether you have to make changes to other configs to do 
this. Also unclear to me whether there was any upside at all to using 
schema.xml? Why degrade functionality? Does the capacity for schema.xml only 
exist for backward compatibility?

If you want to run schema-less, you have to use managed-schema? (I didn’t 
delve too deep into this).

In the end, I used basic-config to create core and then hacked managed-schema 
from there.


I would have to say the "basic-config" seems distinctly more than basic. It is 
still a huge file. I thought perhaps I could delete every unused field type, 
but worried there were some "system" dependencies. Ie if you want *target type 
wildcard queries do you need to have text_general_reverse and a copy to it? If 
you always explicitly set only defined fields in a custom indexer, then can you 
dump the whole dynamic fields bit?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Recommendation for production SOLR

2017-03-06 Thread Phil Scadden

Given the known issues with 6.4.1 and no release date for  6.4.2, is the best 
recommendation for a production version of SOLR 6.3.0? Hoping to take to 
production in first week of April.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Excessive Wire logging while indexing.

2017-03-02 Thread Phil Scadden

Got it all working with Tika and SolrJ. (Got the correct artifacts). Much 
faster now too which is good. Thanks very much for your help.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Excessive Wire logging while indexing. Blank output from tika parser

2017-03-01 Thread Phil Scadden

Belay that. I found out why parser was just returning empty data - I didn’t 
have the right artefact in maven. In case anyone else trips on this:

org.apache.tika
tika-core
1.12


org.apache.tika
tika-parsers
1.12


Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden


>Another side issue:  Using the extracting handler for handling rich documents 
>is discouraged.  Tika (which is what is used by the extracting
>handler) is pretty amazing software, but it has a habit of crashing or 
>consuming all the heap memory when it encounters a document that it doesn't 
>>know how to properly handle.  It is best to run Tika in your external program 
>and send its output to Solr, so that if there's a problem, it won't affect 
>>your search capability.


As an alternative to earlier code, I had tried this (exactly the same set of 
files going in)

File f = new File(filename);
 ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   
Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, 
null,String.format("File %s failed", f.getCanonicalPath()));
   e.printStackTrace();
  }
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
 up.addField("fileLocation",idString);
 up.addField("access",access);
 up.addField("title",metadata.get("title"));
 up.addField("author",metadata.get("author"));
 String content = textHandler.toString();
 up.addField("_text_",content);
 solr.add(up);
 return true;
Exceptions never triggered but metadata was essentially empty except for 
contentType, and content was always an empty string. I don’t know what parser 
was doing, but I gave up and with the extractHandler route instead which did at 
least build a full index.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden

The logging is coming from application which is running in Tomcat. Solr itself 
is running in the embedded Jetty.

And yes, another look at the log4j and I see that rootlogger is set to DEBUG. 
I've changed that/

>On the Solr server side, the 6.4.x versions have a bug that causes extremely 
>high CPU usage and very slow operation.  This will be fixed in 6.4.2, which 
>>will hopefully be out soon.  There is currently no ETA for this release.

Thanks for that. I will watch for the release.

>Another side issue:  Using the extracting handler for handling rich documents 
>is discouraged.  Tika (which is what is used by the extracting
>handler) is pretty amazing software, but it has a habit of crashing or 
>consuming all the heap memory when it encounters a document that it doesn't 
>>know how to properly handle.  It is best to run Tika in your external program 
>and send its output to Solr, so that if there's a problem, it won't affect 
>>your search capability.

That was frankly what I thought I was doing.

The calling code is:
public Long SolrIndex(String path, String target, String URL, 
Listroles, CurrentState curState) {
   HttpSolrClient solr = new HttpSolrClient(URL);
   try {
solr.deleteByQuery("*:*");
solr.commit();
   } catch (SolrServerException | IOException ex) {
Logger.getLogger(JsMapService.class.getName()).log(Level.SEVERE, 
null, ex);
   }

... lot of stuff to find data to index and set fields
...
  if (solrIndexFile(solr,filename,access,mi.getMcontent())) count++;

try {
solr.commit();
} catch (SolrServerException | IOException ex) {
Logger.getLogger(JsMapService.class.getName()).log(Level.SEVERE, 
null, ex);
}
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden

Using Solr 6.4.1 on windows. Installed and trial POST on my directories worked 
okay. However, now trying to create an index from code running on tomcat on the 
same machine as SOLR server with my own schema. Indexing of PDF is very slow. 
Investigating that find my tomcat output full of wire
12:07:06,758 DEBUG wire:72 -  >> "[0xe3][0xe0][0xc7]L[0xf5][0xea][\r]?
12:07:06,763 DEBUG wire:72 -  >> 
"f[0x81][0xb0]b[0xca][0xfa][0xb7][0x1f]n[0xff][0x0][0xa8][0xd0][0x16][0xbb]*[0xfe][0x95][0x98]-[0xbd][0xb7]-
12:07:06,805 DEBUG wire:72 -  >> 
"X[0x3][0xd4][0xcf]OOS:[0x94][0x8b][0xe2][0x89][0x8f][0xc9]rClVQ[0x85][0x1e][0x82]T[0x9e][0xe7]N[0xf4][0xfa]-
This cant be helping. My code is:

  try {
ContentStreamUpdateRequest up = new 
ContentStreamUpdateRequest("/update/extract");
File f = new File(filename);
ContentStreamBase.FileStream cs = new 
ContentStreamBase.FileStream(f);
up.addContentStream(cs);
up.setParam("literal.id",f.getPath());
up.setParam("literal.location", idString);
up.setParam("literal.access",access.toString());
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);

All the logging generated by last line. I don’t have any httpclient.wire lines 
in my log4j.properties (presume these are from httpclient.wire). What do I do 
to turn this off?


Phil Scadden, Geoscientist
GNS Science I Te Pῡ Ao

764 Cumberland Street, Dunedin 9016, Private Bag 1930, Dunedin 9054, New Zealand
Ph +64 3 4799663  Mob +64 027 3463185 I Fax OFFICE-FAX +64 3 477 5232
http://www.gns.cri.nz/ I Email: p.scad...@gns.cri.nz


Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

IllegalStateException locking up solr during query.

2011-10-31 Thread Phil Scadden

Seems to happen when I ask for more pages of results, but solr 
essentially stops working. Half an hour later it was working okay. Solr 
3.4 on tomcat 5.5.15
Logs look like: (example of one of many...)
Any ideas very welcome.

1/11/2011 12:00:14 org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet SolrServer threw exception
java.lang.IllegalStateException
 at 
org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:404)
 at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:380)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:283)
 at sun.reflect.GeneratedMethodAccessor101.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
 at 
org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
 at 
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:217)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:197)
 at 
org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:50)
 at 
org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:156)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:152)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541)
 at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
 at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:667)
 at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
 at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:619)

-- 
Phil Scadden, Senior Scientist GNS Science Ltd 764 Cumberland St, 
Private Bag 1930, Dunedin, New Zealand Ph +64 3 4799663, fax +64 3 477 5232


Notice: This email and any attachments are confidential. If received in error 
please destroy and immediately notify us. Do not copy or disclose the contents.

Re: question from a beginner

2011-10-30 Thread Phil Scadden

Look up highlighting. http://wiki.apache.org/solr/HighlightingParameters


Notice: This email and any attachments are confidential. If received in error 
please destroy and immediately notify us. Do not copy or disclose the contents.

Timeout trying to index from nutch

2011-08-11 Thread Phil Scadden

I am new user and I have SOLR installed. I can use the admin page and 
query the example data.
However, I was using nutch to load index with intranet web pages and I 
got this message.

SolrIndexer: starting at 2011-08-12 16:52:44
org.apache.solr.client.solrj.SolrServerException: 
java.net.ConnectException: Connection timed out

Timeout happened after about 12 minutes. I cant seem to find this 
message in an archive search. Can anyone give me some clues?

Notice: This email and any attachments are confidential. If received in error 
please destroy and immediately notify us. Do not copy or disclose the contents.

72 matches

Mail list logo