Code for solrj is going to be very dependent on your needs but the beating
heart of my code is below ( note that I do OCR as separate step before feeding
files into indexer). Solrj and tika docs should help.
File f = new File(filename);
ContentHandler textHandler = new
First off, use basic authentication to at least partially lock it down. Only
the application server has access to the password. Second, our IT people
thought Solr security insufficient to even remotely consider exposing to
external web. It lives behind firewall so do a kind of proxy. External
do not want to have your Solr instance
processing via Tika”? If that’s a bad design choice please elaborate.
Thanks,
Geoff
> On Mar 19, 2019, at 5:15 PM, Phil Scadden wrote:
>
> As per Erick advice, I would strongly recommend that you do anything tika in
> a separate solrj prog
As per Erick advice, I would strongly recommend that you do anything tika in a
separate solrj programme. You do not want to have your solr instance processing
via tika.
-Original Message-
From: Tannen, Lev (USAEO) [Contractor]
Sent: Wednesday, 20 March 2019 08:17
To:
I would strongly consider OCR offline, BEFORE loading the documents into Solr.
The advantage of this is that you convert your OCRed PDF into searchable PDF.
Consider someone using Solr and they have found a document that matches their
search criteria. Once they retrieve the document, they will
I will second the SolrJ method. You don’t want to be doing this on your SOLR
instance. One question is whether your PDFs are scanned or are already
searchable. I use tesseract offline to convert all scanned PDFs into searchable
PDF so I don’t want Tika to be doing that. My code core is:
I would also second the proxy approach. Beside keeping your solr instance
behind a firewall and not directly exposed, you can do a lot in a proxy.
Per-user control over which index they are access, filtering of queries, etc.
-Original Message-
From: Emir Arnautović
I always filter solr request via a proxy (so solr itself is not exposed
directly to the web). In that proxy, the query parameters can be broken down
and filtered as desired (I examine authorities granted to a session to control
even which indexes are being searched) before passing the modified
consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
>
> Erick
>
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden <p.scad...@
I am indexing PDFs and a separate process has converted any image PDFs to
search PDF before solr gets near it. I notice that tika is very slow at parsing
some PDFs. I don't need any metadata (which I suspect is slowing tika down),
just the text. Has anyone used an alternative PDF text
get
some advantage from having more data points about the “text” and “title” fields.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Dec 4, 2017, at 7:17 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> Thanks Eric. I have alrea
false}) is if you are using bare NOW in your clauses for, say ranges,
one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the
subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
Best,
Erick
On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <p.scad.
>You'll have a few economies of scale I think with a single core, but frankly I
>don't know if they'd be enough to measure. You say the docs are "quite large"
>though, >are you talking books? Magazine articles? is 20K large or are the 20M?
Technical reports. Sometimes up to 200MB pdfs, but that
I have two different document stores that I want index. Both are quite small
(<50,000 documents though documents can be quite large). They are quite capable
of using the same schema, but you would not want to search both simultaneously.
I can see two approaches to handling this case.
1/ Create
Yes, that worked.
-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Thursday, 2 November 2017 6:14 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.
On 11/1/2017 10:04 PM, Phil Scadden wrote:
> For testing, I chan
Requested reload and now it indexes with secure server using HttpSolrClietn.
Phew. I now look to see if I can optimize and get concurrentupdate server to
work.
At least I can get the index back now.
-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Thursday, 2
ave access to the server.
This is a frustrating problem.
-Original Message-
From: Shawn Heisey [mailto:elyog...@elyograg.org]
Sent: Thursday, 2 November 2017 3:55 p.m.
To: solr-user@lucene.apache.org
Subject: Re: adding documents to a secured solr server.
On 11/1/2017 8:13 PM, Phil Scadden wrot
: adding documents to a secured solr server.
On 11/1/2017 8:13 PM, Phil Scadden wrote:
> 14:52:45,962 DEBUG ConcurrentUpdateSolrClient:177 - starting runner:
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6e
> eba4a
> 14:52:46,224 WARN ConcurrentUpdateSolrClient:
]
Sent: Thursday, 2 November 2017 3:13 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.
On 11/1/2017 4:22 PM, Phil Scadden wrote:
> Except that I am using solrj in an intermediary proxy and passing the
> response directly to a javascript client. It is
t;name":"read",
"role":"guest"}],
"user-role":{"solrAdmin":["admin","guest"],"solrGuest":"guest"}}}
It looks like I should be able to add.
this one worked to delete the entire index:
UpdateRequest up = ne
24 DEBUG ConcurrentUpdateSolrClient:210 - finished:
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner@6eeba4a
Even more puzzling. Authentication is set. What is the invalid version bit?? I
think my solrj is 6.4.1; the server is 6.6.2. Do these have to match exactly??
-Original Message-
From: Phil Scadden [mailto:p.s
Solrj QueryRequest object has a method to set basic authorization
username/password but what is the equivalent way to pass authorization when you
are adding new documents to an index?
ConcurrentUpdateSolrClient solr = new
ConcurrentUpdateSolrClient(solrProperties.getServer(),10,2);
...
.
To: solr-user@lucene.apache.org
Subject: Re: Stateless queries to secured SOLR server.
On 10/31/2017 2:08 PM, Phil Scadden wrote:
> Thanks Shawn. I have done it with SolrJ. Apart from needing the
> NoopResponseParser to handle the wt=, it was pretty painless.
This is confusing to me, because with
: Stateless queries to secured SOLR server.
On 10/29/2017 6:13 PM, Phil Scadden wrote:
> While SOLR is behind a firewall, I want to now move to a secured SOLR
> environment. I had been hoping to keep SOLRJ out of the picture and just
> using httpURLConnection. However, I also don't want to
While SOLR is behind a firewall, I want to now move to a secured SOLR
environment. I had been hoping to keep SOLRJ out of the picture and just using
httpURLConnection. However, I also don't want to maintain session state,
preferring to send authentication with every request. Is this possible
Now that I am got a big hunk of documents indexed with Solr, I am looking to
see whether I can try some machine learning tools to try and extract
bibliographic references out of the documents. Anyone got some recommendations
about which kits might be good to play with for something like this?
ception ex) {
}
// start the index rebuild
-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Wednesday, 27 September 2017 10:04 a.m.
To: solr-user@lucene.apache.org
Subject: RE: DocValues, Long and SolrJ
I get it after I have deleted the index with a delete query and st
changed after some
documents being indexed.
Thanks,
Emir
> On 25 Sep 2017, at 23:42, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> I ran into a problem with indexing documents which I worked around by
> changing data type, but I am curious as to how the setup could be made to
I ran into a problem with indexing documents which I worked around by changing
data type, but I am curious as to how the setup could be made to work.
Solr 6.5.1 - Field type Long, multivalued false, DocValues.
In indexing with Solr, I set the value of field with:
Long
MERIC for field". Beats me
what it expects for values in document.addField(...), but changing the field
type from Long to Int fixed it.
-Original Message-----
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Sunday, 24 September 2017 4:35 p.m.
To: solr-user@lucene.apache.org
Subject: S
I am attempted to redo an index job. The delete query worked fine but on
reindex, I get this:
09:42:51,061 ERROR ConcurrentUpdateSolrClient:463 - error
org.apache.solr.common.SolrException: Bad Request
request: http://online-uat:8983/solr/prindex/update?wt=javabin=2
at
straws here
mind you.
Best,
Erick
On Thu, Aug 24, 2017 at 9:02 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> SOLR_HOME is /var/www/solr/data
> The zip was actually the entire data directory which also included
> configsets. And yes core.properties is in var/www/solr/data/prindex (jus
5 seems a reasonable limit to me. After that revert to slow.
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, 2 September 2017 12:01 p.m.
To: solr-user
Subject: Re: query with wild card with AND taking lot of time
How
SOLR_HOME is /var/www/solr/data
The zip was actually the entire data directory which also included configsets.
And yes core.properties is in var/www/solr/data/prindex (just has single line
name=prindex, in it). No other cores are present.
The data directory should have been unzipped before the
I am slowing moving 6.5.1 from development to production. After installing solr
on the final test machine, I tried to supply a core by zipping up the data
directory on development and unzipping on test.
When I go to admin I get:
[cid:image001.png@01D31DA9.1B0EF540]
Write.lock obviously causing a
Perhaps there is potential to optimize with some PLSQL functions on Oracle side
to do as much work within database as possible and have the text indexers only
access a view referencing that function. Also, the obvious optimization is a
record-updated timestamp so that every time indexer runs,
When I am putting PDF documents and rows from a table into the same index, I
create "dataSource" field to identify the source and I don't copy database
fields - only index them - apart from the unique key which is stored as
"document". On search, you process the output before passing to user.
:58 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr
Hi Phil Scadden,
Thank you for your reply,
we tried your suggested solution by removing hyphen while indexing, but it was
getting wrong results. i was searching for "شرطة ازكي" and it was showing me
Further to that. What results do you get when you put those indexed terms into
the Analysis tool on the Solr UI?
-Original Message-
From: Phil Scadden [mailto:p.scad...@gns.cri.nz]
Sent: Tuesday, 1 August 2017 9:06 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search
Am I correct in assuming that you have the problem searching only when there is
a hyphen in your indexed text? If you, then it would suggest that you need to
use a different tokenizer when indexing - it looks like the hyphen is removed
and words each side are concatenated - hence need both
The simplest suggestion is get rid of the stop word filter. I've seen people
here comment that it is not worth it for the amount of space it saves.
-Original Message-
From: shamik [mailto:sham...@gmail.com]
Sent: Friday, 21 July 2017 9:49 a.m.
To: solr-user@lucene.apache.org
Subject: Re:
http - however, the big advantage of doing your indexing on different machine
is that the heavy lifting that tika does in extracting text from documents,
finding metadata etc is not happening on the server. If the indexer crashes, it
doesn’t affect Solr either.
-Original Message-
the output?What do you get going directly to
Solr's endpoint?
Erik
> On Jun 14, 2017, at 22:13, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> If I try
> /getsolr?
> fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv
>
> The response I get is:
&
Just had similar issue - works for some, not others. First thing to look at is
hl.maxAnalyzedChars is the query. The default is quite small.
Since many of my documents are large PDF files, I opted to use
storeOffsetsWithPositions="true" termVectors="true" on the field I was
searching on.
This
If I try
/getsolr?
fl=id,title,datasource,score=true=9000=unified=Wainui-1=AND=csv
The response I get is:
id,title,datasource,scoreW:\PR_Reports\OCR\PR869.pdf,,Petroleum
Reports,8.233313W:\PR_Reports\OCR\PR3440.pdf,,Petroleum
Reports,8.217836W:\PR_Reports\OCR\PR4313.pdf,,Petroleum
, 2017 at 9:58 PM Phil Scadden <p.scad...@gns.cri.nz> wrote:
> Tried hard to find difference between pdfs returning no highlighter
> and ones that do for same search term. Includes pdfs that have been
> OCRed and ones that were text to begin with. Head scratching to me.
>
>
.
To: Phil Scadden <p.scad...@gns.cri.nz>
Subject: Re: including a minus sign "-" in the token
On 6/9/2017 8:12 PM, Phil Scadden wrote:
> So, the field I am using for search has type of:
>positionIncrementGap="100" multiValued="true"&
n Heisey [mailto:apa...@elyograg.org]
Sent: Saturday, 10 June 2017 12:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: including a minus sign "-" in the token
On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have
> convention
8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> Do a search with:
> fl=id,title,datasource=true=unified=50=1=pre
> ssure+AND+testing=50=0=json
>
> and I get back a good list of documents. However, some documents are
> returning empty fields in the high
h-all field.
And you don't want to store that information anyway since it's usually the
destination of copyField directives and you'd highlight _those_ fields.
Best,
Erick
On Thu, Jun 8, 2017 at 8:37 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
> Do a search with:
> fl=id,title,da
Do a search with:
fl=id,title,datasource=true=unified=50=1=pressure+AND+testing=50=0=json
and I get back a good list of documents. However, some documents are returning
empty fields in the highlighter. Eg, in the highlight array have:
"W:\\Reports\\OCR\\4272.pdf":{"_text_":[]}
Getting this well
We have important entities referenced in indexed documents which have
convention naming of geographicname-number. Eg Wainui-8
I want the tokenizer to treat it as Wainui-8 when indexing, and when I search I
want to a q of Wainui-8 (must it be specified as Wainui\-8 ??) to return docs
with
name in the path.
Tomás
Sent from my iPhone
> On Jun 5, 2017, at 9:08 PM, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
> Simple piece of code. Had been working earlier (though against a 6.4.2
> instance).
>
> ConcurrentUpdateSolrClient solr = new
> ConcurrentUpda
Simple piece of code. Had been working earlier (though against a 6.4.2
instance).
ConcurrentUpdateSolrClient solr = new
ConcurrentUpdateSolrClient("http://myhost:8983/solr",10,2);
try {
solr.deleteByQuery("*:*");
solr.commit();
} catch
Yes, that would seem an accurate assessment of the problem.
-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Thursday, 30 March 2017 4:53 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR
Thanks for your reply.
Well I haven’t had to deal with a problem that size, but it seems to me that
you have little alternative except through more computer hardware at it. For
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing
workflow. I used pdftotext utility to extract text from pdf. If
Only by 10? You must have quite small documents. OCR is extremely expensive
process. Indexing is trivial by comparison. For quite large documents I am
working with OCR can be 100 times slower than indexing a PDF that is searchable
(text extractable without OCR).
-Original Message-
While building directly into Solr might be appealing, I would argue that it is
best to use OCR software first, outside of SOLR, to convert the PDF into
"searchable" PDF format. That way when the document is retrieved, it is a lot
more useful to the searcher - making it easy to find the text
The admin gui displays the time of last commit to a core but how can this be
queried from within SolrJ?
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of Geological and Nuclear Sciences
I have added a signature field to schema and setup dedupe handler in
solrconfig.xml as per docs, however docs say:
“Be sure to change your update handlers to use the defined chain, as below:”
Umm, WHERE do you change the update handler to use the defined chain? Is this
in one of config xmls or
What we are suggesting is that your browser does NOT access solr directly at
all. In fact, configure firewall so that SOLR is unreachable outside the
server. Instead you write a proxy in your site application which calls SOLR
instead. Ie a server-to-server call instead of browser-to-server.
>The first advise is NOT to expose your Solr directly to the public.
>Anyone that can hit /search, can also hit /update and wipe out your index.
I would second that too. We have never exposed Solr and I also sanitise queries
in the proxy.
Notice: This email and any attachments are confidential
I would second that guide could be clearer on that. I read and reread several
times trying to get my head around the schema.xml/managed-schema bit. I came
away from first cursory reading with the idea that managed-schema was mostly
for schema-less mode and only after some stuff ups and puzzling
Given the known issues with 6.4.1 and no release date for 6.4.2, is the best
recommendation for a production version of SOLR 6.3.0? Hoping to take to
production in first week of April.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed
Got it all working with Tika and SolrJ. (Got the correct artifacts). Much
faster now too which is good. Thanks very much for your help.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of
Belay that. I found out why parser was just returning empty data - I didn’t
have the right artefact in maven. In case anyone else trips on this:
org.apache.tika
tika-core
1.12
org.apache.tika
tika-parsers
>Another side issue: Using the extracting handler for handling rich documents
>is discouraged. Tika (which is what is used by the extracting
>handler) is pretty amazing software, but it has a habit of crashing or
>consuming all the heap memory when it encounters a document that it doesn't
The logging is coming from application which is running in Tomcat. Solr itself
is running in the embedded Jetty.
And yes, another look at the log4j and I see that rootlogger is set to DEBUG.
I've changed that/
>On the Solr server side, the 6.4.x versions have a bug that causes extremely
>high
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
All the logging generated by last line. I don’t have any httpclient.wire lines
in my log4j.properties (presume these are from httpclient.wire). What do I do
to turn this off?
Phil Scadden,
)
at java.lang.Thread.run(Thread.java:619)
--
Phil Scadden, Senior Scientist GNS Science Ltd 764 Cumberland St,
Private Bag 1930, Dunedin, New Zealand Ph +64 3 4799663, fax +64 3 477 5232
Notice: This email and any attachments are confidential. If received in error
please destroy and immediately notify us. Do
Look up highlighting. http://wiki.apache.org/solr/HighlightingParameters
Notice: This email and any attachments are confidential. If received in error
please destroy and immediately notify us. Do not copy or disclose the contents.
I am new user and I have SOLR installed. I can use the admin page and
query the example data.
However, I was using nutch to load index with intranet web pages and I
got this message.
SolrIndexer: starting at 2011-08-12 16:52:44
org.apache.solr.client.solrj.SolrServerException:
72 matches
Mail list logo