Re: Is there a tool to directly index hdfs files to solr?

2018-10-21 Thread Jason Gerlowski
Not familiar with the contrib you mentioned, or the rationale behind its removal. But as to your first question, you might be interested in looking at: https://github.com/lucidworks/hadoop-solr Disclaimer: I help maintain the "hadoop-solr" project mentioned. On Thu, Oct 18, 2018 at 8:17 AM

AW: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV
On 10/21/2018 01:06 PM, Shawn Heisey wrote: > You do it with the request, not with the client For the UpdateRequests it is the "commitWithinMs"-parameter? To me this parameter sounds like telling the solr-server I need to see this data within "x ms". As we have autoCommit and autoSoftCommit ...

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi Alex, Thanks again for your reply, much appreciated. Martin Frank Hansen, Senior Data Analytiker Data, IM & Analytics Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web www.kmd.dk Mobil +4525571418 -Oprindelig meddelelse- Fra: Alexandre Rafalovitch Sendt: 21. oktober

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS
Hi Alexandre, Thank you. How this explain the issue exists only with SolrCloud and not standalone? Moshe From: Alexandre Rafalovitch Sent: Sunday, October 21, 2018 5:18:24 PM To: solr-user Subject: Re: Error while indexing Thai core with SolrCloud I would

Re: Tesseract language

2018-10-21 Thread Alexandre Rafalovitch
There is a couple of things mixed in here: 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr

Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS
Hi, We've specific exception that happening only on Thai core and only once we're using SolrCloud. Same indexing activity is running successfully while running on EN core with SolrCloud or with Thai core and standalone configuration. We're running on Linux with Solr 4.6 and with

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch
I would check if the Byte-order mark is the cause: https://en.wikipedia.org/wiki/Byte_order_mark The error message does not seem to be a perfect match to this issue, but a good thing to check anyway. That symbol (right at the file start) is usually invisible and can trip Java XML parsers for

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi Alexandre, Thanks for your reply. Yes right now it is just for testing the possibilities of Solr and Tesseract. I will take a look at the Tika documentation to see if I can make it work. You said that DIH are not recommended for production usage, what is the recommended method(s) to upload

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS
Thank you. Will check all options and let you know. From: Alexandre Rafalovitch Sent: Sunday, October 21, 2018 8:09:34 PM To: solr-user Subject: Re: Error while indexing Thai core with SolrCloud Ok, That may have been a bit too much :-) However, it was useful.

Re: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Shawn Heisey
On 10/21/2018 11:43 AM, Clemens Wyss DEV wrote: If I omit the core in the url upon creation of the SolrClient, where can I then "indicate" the core? You do it with the request, not with the client.

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi again, Is there anyone who has some experience of using Tesseract’s OCR module within Solr? The files I am trying to read into Solr is Danish Tiff documents. Martin Frank Hansen, Senior Data Analytiker Data, IM & Analytics [cid:image001.png@01D383C9.6C129A60] Lautrupparken 40-42, DK-2750

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch
Ok, If the same file and the same core definition works on a standalone, then the issue may be different. Can you please share the full stack trace of the message. It may be important to see which thread died. Also, I would just spin up a test Solr 7.5 instance and see if the problem is still

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch
Ok, That may have been a bit too much :-) However, it was useful. There seem to have several possible avenues: 1) You are using SolrJ and your SolrJ version is not the same as the version of the Solr server. There was a bunch of things that could trigger, especially in combination with Unicode

AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV
Thx Shawn! > If they're sleeping, then it's unlikely that there's any real contribution to > system load. I know, but > seeing threads you didn't expect to see? exactly this > You should really be keeping one SolrClient per server node, >and indicating which core to access with each request

Re: Tesseract language

2018-10-21 Thread Alexandre Rafalovitch
Usually, we just say to do a custom solution using SolrJ client to connect. This gives you maximum flexibility and allows to integrate Tika either inside your code or as a server. Latest Tika actually has some off-thread handling I believe, to make it safer to embed. For DIH alternatives, if you

Re: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Shawn Heisey
On 10/21/2018 10:13 AM, Clemens Wyss DEV wrote: Just upgrading from 6.6 to 7.5 and am now seeing many "Connection evcitor"-threads which are all Thread.slee()ing ... What's the stacktrace on those threads?  If they're sleeping, then it's unlikely that there's any real contribution to system

6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV
Just upgrading from 6.6 to 7.5 and am now seeing many "Connection evcitor"-threads which are all Thread.slee()ing ... As of 6.6 I am keeping the SolrClients (one per core) in a HashMap. Is this ok or should I create a new SolrClient for each request I am doing? SolrClient creation is as

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS
Hi, Thank you. Full stacktrace below "core_node_name":"172.19.218.201:8082_solr_core_th"}DEBUG - 2018-10-19 02:13:20.343; org.apache.zookeeper.ClientCnxn$SendThread; Reading reply sessionid:0x200b5a04a770005, packet:: clientPath:null serverPath:null finished:false header:: 356,1

Re: Tesseract language

2018-10-21 Thread Erick Erickson
Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out https://lucidworks.com/2012/02/14/indexing-with-solrj/ On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch wrote: > > Usually, we just say to do a custom solution using SolrJ client to > connect. This

Re: Tesseract language

2018-10-21 Thread Gus Heck
Hi Martin, I wrote a framework (https://github.com/nsoft/jesterj) that is meant to help with small to medium custom solutions It's not (yet) ready for cases where you need multiple machines feeding data, but so long as a single box can do the work it should be useful. It has a basic Tika stage