RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden
Code for solrj is going to be very dependent on your needs but the beating heart of my code is below ( note that I do OCR as separate step before feeding files into indexer). Solrj and tika docs should help. File f = new File(filename); ContentHandler textHandler = new

RE: How do *you* restrict access to Solr?

2020-03-16 Thread Phil Scadden
First off, use basic authentication to at least partially lock it down. Only the application server has access to the password. Second, our IT people thought Solr security insufficient to even remotely consider exposing to external web. It lives behind firewall so do a kind of proxy. External

RE: Upgrading tika

2019-03-20 Thread Phil Scadden
do not want to have your Solr instance processing via Tika”? If that’s a bad design choice please elaborate. Thanks, Geoff > On Mar 19, 2019, at 5:15 PM, Phil Scadden wrote: > > As per Erick advice, I would strongly recommend that you do anything tika in > a separate solrj prog

RE: Upgrading tika

2019-03-19 Thread Phil Scadden
As per Erick advice, I would strongly recommend that you do anything tika in a separate solrj programme. You do not want to have your solr instance processing via tika. -Original Message- From: Tannen, Lev (USAEO) [Contractor] Sent: Wednesday, 20 March 2019 08:17 To:

RE: Solr OCR Support

2018-11-04 Thread Phil Scadden
I would strongly consider OCR offline, BEFORE loading the documents into Solr. The advantage of this is that you convert your OCRed PDF into searchable PDF. Consider someone using Solr and they have found a document that matches their search criteria. Once they retrieve the document, they will

RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden
I will second the SolrJ method. You don’t want to be doing this on your SOLR instance. One question is whether your PDFs are scanned or are already searchable. I use tesseract offline to convert all scanned PDFs into searchable PDF so I don’t want Tika to be doing that. My code core is:

RE: Solr Read-Only?

2018-03-07 Thread Phil Scadden
I would also second the proxy approach. Beside keeping your solr instance behind a firewall and not directly exposed, you can do a lot in a proxy. Per-user control over which index they are access, filtering of queries, etc. -Original Message- From: Emir Arnautović

RE: Turn on/off query based on a url parameter

2018-02-22 Thread Phil Scadden
I always filter solr request via a proxy (so solr itself is not exposed directly to the web). In that proxy, the query parameters can be broken down and filtered as desired (I examine authorities granted to a session to control even which indexes are being searched) before passing the modified

RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
consider how much space between > letters in words (in the body text) should be allowed and still > consider it a single word. I'm not quite sure how to prove that, but > I'd be willing to make a bet ;) > > Erick > > On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@

Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
I am indexing PDFs and a separate process has converted any image PDFs to search PDF before solr gets near it. I notice that tika is very slow at parsing some PDFs. I don't need any metadata (which I suspect is slowing tika down), just the text. Has anyone used an alternative PDF text

RE: Multiple cores versus a "source" field.

2017-12-05 Thread Phil Scadden
get some advantage from having more data points about the “text” and “title” fields. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 4, 2017, at 7:17 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > > Thanks Eric. I have alrea

RE: Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden
false}) is if you are using bare NOW in your clauses for, say ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the subject: https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/ Best, Erick On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad.

RE: Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden
>You'll have a few economies of scale I think with a single core, but frankly I >don't know if they'd be enough to measure. You say the docs are "quite large" >though, >are you talking books? Magazine articles? is 20K large or are the 20M? Technical reports. Sometimes up to 200MB pdfs, but that

Multiple cores versus a "source" field.

2017-12-04 Thread Phil Scadden
I have two different document stores that I want index. Both are quite small (<50,000 documents though documents can be quite large). They are quite capable of using the same schema, but you would not want to search both simultaneously. I can see two approaches to handling this case. 1/ Create

RE: adding documents to a secured solr server.

2017-11-02 Thread Phil Scadden
Yes, that worked. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, 2 November 2017 6:14 p.m. To: solr-user@lucene.apache.org Subject: Re: adding documents to a secured solr server. On 11/1/2017 10:04 PM, Phil Scadden wrote: > For testing, I chan

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
Requested reload and now it indexes with secure server using HttpSolrClietn. Phew. I now look to see if I can optimize and get concurrentupdate server to work. At least I can get the index back now. -Original Message- From: Phil Scadden [mailto:p.scad...@gns.cri.nz] Sent: Thursday, 2

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
ave access to the server. This is a frustrating problem. -Original Message- From: Shawn Heisey [mailto:elyog...@elyograg.org] Sent: Thursday, 2 November 2017 3:55 p.m. To: solr-user@lucene.apache.org Subject: Re: adding documents to a secured solr server. On 11/1/2017 8:13 PM, Phil Scadden wrot

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
: adding documents to a secured solr server. On 11/1/2017 8:13 PM, Phil Scadden wrote: > 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner: > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e > eba4a > 14:52:46,224 WARN ConcurrentUpdateSolrClient:

RE: Stateless queries to secured SOLR server.

2017-11-01 Thread Phil Scadden
] Sent: Thursday, 2 November 2017 3:13 p.m. To: solr-user@lucene.apache.org Subject: Re: Stateless queries to secured SOLR server. On 11/1/2017 4:22 PM, Phil Scadden wrote: > Except that I am using solrj in an intermediary proxy and passing the > response directly to a javascript client. It is

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
t;name":"read", "role":"guest"}], "user-role":{"solrAdmin":["admin","guest"],"solrGuest":"guest"}}} It looks like I should be able to add. this one worked to delete the entire index: UpdateRequest up = ne

RE: adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
24 DEBUG ConcurrentUpdateSolrClient:210 - finished: org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a Even more puzzling. Authentication is set. What is the invalid version bit?? I think my solrj is 6.4.1; the server is 6.6.2. Do these have to match exactly?? -Original Message- From: Phil Scadden [mailto:p.s

adding documents to a secured solr server.

2017-11-01 Thread Phil Scadden
Solrj QueryRequest object has a method to set basic authorization username/password but what is the equivalent way to pass authorization when you are adding new documents to an index? ConcurrentUpdateSolrClient solr = new ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2); ...

RE: Stateless queries to secured SOLR server.

2017-11-01 Thread Phil Scadden
. To: solr-user@lucene.apache.org Subject: Re: Stateless queries to secured SOLR server. On 10/31/2017 2:08 PM, Phil Scadden wrote: > Thanks Shawn. I have done it with SolrJ. Apart from needing the > NoopResponseParser to handle the wt=, it was pretty painless. This is confusing to me, because with

RE: Stateless queries to secured SOLR server.

2017-10-31 Thread Phil Scadden
: Stateless queries to secured SOLR server. On 10/29/2017 6:13 PM, Phil Scadden wrote: > While SOLR is behind a firewall, I want to now move to a secured SOLR > environment. I had been hoping to keep SOLRJ out of the picture and just > using httpURLConnection. However, I also don't want to

Stateless queries to secured SOLR server.

2017-10-29 Thread Phil Scadden
While SOLR is behind a firewall, I want to now move to a secured SOLR environment. I had been hoping to keep SOLRJ out of the picture and just using httpURLConnection. However, I also don't want to maintain session state, preferring to send authentication with every request. Is this possible

solr and machine learning - recommendations?

2017-10-05 Thread Phil Scadden
Now that I am got a big hunk of documents indexed with Solr, I am looking to see whether I can try some machine learning tools to try and extract bibliographic references out of the documents. Anyone got some recommendations about which kits might be good to play with for something like this?

RE: DocValues, Long and SolrJ

2017-09-26 Thread Phil Scadden
ception ex) { } // start the index rebuild -Original Message----- From: Phil Scadden [mailto:p.scad...@gns.cri.nz] Sent: Wednesday, 27 September 2017 10:04 a.m. To: solr-user@lucene.apache.org Subject: RE: DocValues, Long and SolrJ I get it after I have deleted the index with a delete query and st

RE: DocValues, Long and SolrJ

2017-09-26 Thread Phil Scadden
changed after some documents being indexed. Thanks, Emir > On 25 Sep 2017, at 23:42, Phil Scadden <p.scad...@gns.cri.nz> wrote: > > I ran into a problem with indexing documents which I worked around by > changing data type, but I am curious as to how the setup could be made to

DocValues, Long and SolrJ

2017-09-25 Thread Phil Scadden
I ran into a problem with indexing documents which I worked around by changing data type, but I am curious as to how the setup could be made to work. Solr 6.5.1 - Field type Long, multivalued false, DocValues. In indexing with Solr, I set the value of field with: Long

RE: Solr update failing on remote server but works locally??

2017-09-24 Thread Phil Scadden
MERIC for field". Beats me what it expects for values in document.addField(...), but changing the field type from Long to Int fixed it. -Original Message----- From: Phil Scadden [mailto:p.scad...@gns.cri.nz] Sent: Sunday, 24 September 2017 4:35 p.m. To: solr-user@lucene.apache.org Subject: S

Solr update failing on remote server but works locally??

2017-09-23 Thread Phil Scadden
I am attempted to redo an index job. The delete query worked fine but on reindex, I get this: 09:42:51,061 ERROR ConcurrentUpdateSolrClient:463 - error org.apache.solr.common.SolrException: Bad Request request: http://online-uat:8983/solr/prindex/update?wt=javabin=2 at

RE: write.lock file appears and solr wont open

2017-09-04 Thread Phil Scadden
straws here mind you. Best, Erick On Thu, Aug 24, 2017 at 9:02 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > SOLR_HOME is /var/www/solr/data > The zip was actually the entire data directory which also included > configsets. And yes core.properties is in var/www/solr/data/prindex (jus

RE: query with wild card with AND taking lot of time

2017-09-03 Thread Phil Scadden
5 seems a reasonable limit to me. After that revert to slow. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Saturday, 2 September 2017 12:01 p.m. To: solr-user Subject: Re: query with wild card with AND taking lot of time How

RE: write.lock file appears and solr wont open

2017-08-24 Thread Phil Scadden
SOLR_HOME is /var/www/solr/data The zip was actually the entire data directory which also included configsets. And yes core.properties is in var/www/solr/data/prindex (just has single line name=prindex, in it). No other cores are present. The data directory should have been unzipped before the

write.lock file appears and solr wont open

2017-08-24 Thread Phil Scadden
I am slowing moving 6.5.1 from development to production. After installing solr on the final test machine, I tried to supply a core by zipping up the data directory on development and unzipping on test. When I go to admin I get: [cid:image001.png@01D31DA9.1B0EF540] Write.lock obviously causing a

RE: Optimizing Dataimport from Oracle; cursor sharing; changing oracle session parameters

2017-08-15 Thread Phil Scadden
Perhaps there is potential to optimize with some PLSQL functions on Oracle side to do as much work within database as possible and have the text indexers only access a view referencing that function. Also, the obvious optimization is a record-updated timestamp so that every time indexer runs,

RE: Storing data in Solr

2017-08-08 Thread Phil Scadden
When I am putting PDF documents and rows from a table into the same index, I create "dataSource" field to identify the source and I don't copy database fields - only index them - apart from the unique key which is stored as "document". On search, you process the output before passing to user.

RE: Arabic words search in solr

2017-08-02 Thread Phil Scadden
:58 a.m. To: solr-user@lucene.apache.org Subject: RE: Arabic words search in solr Hi Phil Scadden, Thank you for your reply, we tried your suggested solution by removing hyphen while indexing, but it was getting wrong results. i was searching for "شرطة ازكي" and it was showing me

RE: Arabic words search in solr

2017-07-31 Thread Phil Scadden
Further to that. What results do you get when you put those indexed terms into the Analysis tool on the Solr UI? -Original Message- From: Phil Scadden [mailto:p.scad...@gns.cri.nz] Sent: Tuesday, 1 August 2017 9:06 a.m. To: solr-user@lucene.apache.org Subject: RE: Arabic words search

RE: Arabic words search in solr

2017-07-31 Thread Phil Scadden
Am I correct in assuming that you have the problem searching only when there is a hyphen in your indexed text? If you, then it would suggest that you need to use a different tokenizer when indexing - it looks like the hyphen is removed and words each side are concatenated - hence need both

RE: Issues trying to boost phrase containing stop word

2017-07-20 Thread Phil Scadden
The simplest suggestion is get rid of the stop word filter. I've seen people here comment that it is not worth it for the amount of space it saves. -Original Message- From: shamik [mailto:sham...@gmail.com] Sent: Friday, 21 July 2017 9:49 a.m. To: solr-user@lucene.apache.org Subject: Re:

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Phil Scadden
http - however, the big advantage of doing your indexing on different machine is that the heavy lifting that tika does in extracting text from documents, finding metadata etc is not happening on the server. If the indexer crashes, it doesn’t affect Solr either. -Original Message-

RE: CSV output

2017-06-15 Thread Phil Scadden
the output?What do you get going directly to Solr's endpoint? Erik > On Jun 14, 2017, at 22:13, Phil Scadden <p.scad...@gns.cri.nz> wrote: > > If I try > /getsolr? > fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv > > The response I get is: &

RE: Issue with highlighter

2017-06-14 Thread Phil Scadden
Just had similar issue - works for some, not others. First thing to look at is hl.maxAnalyzedChars is the query. The default is quite small. Since many of my documents are large PDF files, I opted to use storeOffsetsWithPositions="true" termVectors="true" on the field I was searching on. This

CSV output

2017-06-14 Thread Phil Scadden
If I try /getsolr? fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv The response I get is: id,title,datasource,scoreW:\PR_Reports\OCR\PR869.pdf,,Petroleum Reports,8.233313W:\PR_Reports\OCR\PR3440.pdf,,Petroleum Reports,8.217836W:\PR_Reports\OCR\PR4313.pdf,,Petroleum

RE: Highlighter not working on some documents

2017-06-12 Thread Phil Scadden
, 2017 at 9:58 PM Phil Scadden <p.scad...@gns.cri.nz> wrote: > Tried hard to find difference between pdfs returning no highlighter > and ones that do for same search term. Includes pdfs that have been > OCRed and ones that were text to begin with. Head scratching to me. > >

RE: including a minus sign "-" in the token

2017-06-11 Thread Phil Scadden
. To: Phil Scadden <p.scad...@gns.cri.nz> Subject: Re: including a minus sign "-" in the token On 6/9/2017 8:12 PM, Phil Scadden wrote: > So, the field I am using for search has type of: >positionIncrementGap="100" multiValued="true"&

RE: including a minus sign "-" in the token

2017-06-09 Thread Phil Scadden
n Heisey [mailto:apa...@elyograg.org] Sent: Saturday, 10 June 2017 12:43 a.m. To: solr-user@lucene.apache.org Subject: Re: including a minus sign "-" in the token On 6/8/2017 8:39 PM, Phil Scadden wrote: > We have important entities referenced in indexed documents which have > convention

RE: Highlighter not working on some documents

2017-06-09 Thread Phil Scadden
8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > Do a search with: > fl=id,title,datasource=true=unified=50=1=pre > ssure+AND+testing=50=0=json > > and I get back a good list of documents. However, some documents are > returning empty fields in the high

RE: Highlighter not working on some documents

2017-06-09 Thread Phil Scadden
h-all field. And you don't want to store that information anyway since it's usually the destination of copyField directives and you'd highlight _those_ fields. Best, Erick On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > Do a search with: > fl=id,title,da

Highlighter not working on some documents

2017-06-08 Thread Phil Scadden
Do a search with: fl=id,title,datasource=true=unified=50=1=pressure+AND+testing=50=0=json and I get back a good list of documents. However, some documents are returning empty fields in the highlighter. Eg, in the highlight array have: "W:\\Reports\\OCR\\4272.pdf":{"_text_":[]} Getting this well

including a minus sign "-" in the token

2017-06-08 Thread Phil Scadden
We have important entities referenced in indexed documents which have convention naming of geographicname-number. Eg Wainui-8 I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with

RE: Got a 404 trying to update a solr. 6.5.1 server. /solr/update not found.

2017-06-06 Thread Phil Scadden
name in the path. Tomás Sent from my iPhone > On Jun 5, 2017, at 9:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote: > > Simple piece of code. Had been working earlier (though against a 6.4.2 > instance). > > ConcurrentUpdateSolrClient solr = new > ConcurrentUpda

Got a 404 trying to update a solr. 6.5.1 server. /solr/update not found.

2017-06-05 Thread Phil Scadden
Simple piece of code. Had been working earlier (though against a 6.4.2 instance). ConcurrentUpdateSolrClient solr = new ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2); try { solr.deleteByQuery("*:*"); solr.commit(); } catch

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Phil Scadden
Yes, that would seem an accurate assessment of the problem. -Original Message- From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] Sent: Thursday, 30 March 2017 4:53 p.m. To: solr-user@lucene.apache.org Subject: Re: Indexing speed reduced significantly with OCR Thanks for your reply.

RE: Indexing speed reduced significantly with OCR

2017-03-28 Thread Phil Scadden
Well I haven’t had to deal with a problem that size, but it seems to me that you have little alternative except through more computer hardware at it. For the job I did, I OCRed to convert PDF to searchable PDF outside the indexing workflow. I used pdftotext utility to extract text from pdf. If

RE: Indexing speed reduced significantly with OCR

2017-03-27 Thread Phil Scadden
Only by 10? You must have quite small documents. OCR is extremely expensive process. Indexing is trivial by comparison. For quite large documents I am working with OCR can be 100 times slower than indexing a PDF that is searchable (text extractable without OCR). -Original Message-

RE: Index scanned documents

2017-03-26 Thread Phil Scadden
While building directly into Solr might be appealing, I would argue that it is best to use OCR software first, outside of SOLR, to convert the PDF into "searchable" PDF format. That way when the document is retrieved, it is a lot more useful to the searcher - making it easy to find the text

Finding time of last commit to index from SolrJ?

2017-03-15 Thread Phil Scadden
The admin gui displays the time of last commit to a core but how can this be queried from within SolrJ? Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences

getting "dedupe" management when updating index via tika and SolrJ

2017-03-15 Thread Phil Scadden
I have added a signature field to schema and setup dedupe handler in solrconfig.xml as per docs, however docs say: “Be sure to change your update handlers to use the defined chain, as below:” Umm, WHERE do you change the update handler to use the defined chain? Is this in one of config xmls or

RE: https

2017-03-08 Thread Phil Scadden
What we are suggesting is that your browser does NOT access solr directly at all. In fact, configure firewall so that SOLR is unreachable outside the server. Instead you write a proxy in your site application which calls SOLR instead. Ie a server-to-server call instead of browser-to-server.

RE: https

2017-03-07 Thread Phil Scadden
>The first advise is NOT to expose your Solr directly to the public. >Anyone that can hit /search, can also hit /update and wipe out your index. I would second that too. We have never exposed Solr and I also sanitise queries in the proxy. Notice: This email and any attachments are confidential

RE: Managed schema vs schema.xml

2017-03-07 Thread Phil Scadden
I would second that guide could be clearer on that. I read and reread several times trying to get my head around the schema.xml/managed-schema bit. I came away from first cursory reading with the idea that managed-schema was mostly for schema-less mode and only after some stuff ups and puzzling

Recommendation for production SOLR

2017-03-06 Thread Phil Scadden
Given the known issues with 6.4.1 and no release date for 6.4.2, is the best recommendation for a production version of SOLR 6.3.0? Hoping to take to production in first week of April. Notice: This email and any attachments are confidential and may not be used, published or redistributed

RE: Excessive Wire logging while indexing.

2017-03-02 Thread Phil Scadden
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much faster now too which is good. Thanks very much for your help. Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of

RE: Excessive Wire logging while indexing. Blank output from tika parser

2017-03-01 Thread Phil Scadden
Belay that. I found out why parser was just returning empty data - I didn’t have the right artefact in maven. In case anyone else trips on this: org.apache.tika tika-core 1.12 org.apache.tika tika-parsers

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden
>Another side issue: Using the extracting handler for handling rich documents >is discouraged. Tika (which is what is used by the extracting >handler) is pretty amazing software, but it has a habit of crashing or >consuming all the heap memory when it encounters a document that it doesn't

RE: Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden
The logging is coming from application which is running in Tomcat. Solr itself is running in the embedded Jetty. And yes, another look at the log4j and I see that rootlogger is set to DEBUG. I've changed that/ >On the Solr server side, the 6.4.x versions have a bug that causes extremely >high

Excessive Wire logging while indexing.

2017-03-01 Thread Phil Scadden
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solr.request(up); All the logging generated by last line. I don’t have any httpclient.wire lines in my log4j.properties (presume these are from httpclient.wire). What do I do to turn this off? Phil Scadden,

IllegalStateException locking up solr during query.

2011-10-31 Thread Phil Scadden
) at java.lang.Thread.run(Thread.java:619) -- Phil Scadden, Senior Scientist GNS Science Ltd 764 Cumberland St, Private Bag 1930, Dunedin, New Zealand Ph +64 3 4799663, fax +64 3 477 5232 Notice: This email and any attachments are confidential. If received in error please destroy and immediately notify us. Do

Re: question from a beginner

2011-10-30 Thread Phil Scadden
Look up highlighting. http://wiki.apache.org/solr/HighlightingParameters Notice: This email and any attachments are confidential. If received in error please destroy and immediately notify us. Do not copy or disclose the contents.

Timeout trying to index from nutch

2011-08-11 Thread Phil Scadden
I am new user and I have SOLR installed. I can use the admin page and query the example data. However, I was using nutch to load index with intranet web pages and I got this message. SolrIndexer: starting at 2011-08-12 16:52:44 org.apache.solr.client.solrj.SolrServerException: