Re: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Mikhail Khludnev
I haven't look at SOLR-8311. But for those who need any plugin class to be SolrCoreAware, you can mark it as "implements QueryResponseWriter" this allow to workaround SolrCoreAware restrictions for any class. On Thu, Dec 7, 2017 at 11:56 PM, Markus Jelsma wrote: > cc

RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
Well I have a lot OCRed PDF, but the extremely slow text extract is hard to pin down. The bulk of the OCRed one arent too slow, but then I have one that will take several minutes. I use a little utility, pdftotext.exe, for making a crude guess at whether OCR is necessary and it is much faster

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Walter Underwood
No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some PDFs were effectively a substitution code. Our team joked

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Erick Erickson
I'm going to guess it's the exact opposite. The meta-data is the "semi structured" part which is much easier to collect than the PDF. I mean there are parameters to tweak that consider how much space between letters in words (in the body text) should be allowed and still consider it a single word.

Re: Howto search for § character

2017-12-07 Thread Tim Casey
My last company we ended up writing a custom analyzer to handle punctuation. But this was for lucent 2 or 3. That analyzer was carried forward as we updated and was used for all human derived text. Although now there are way better analyzers and way better ways to hook them up, as noted above

Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
I am indexing PDFs and a separate process has converted any image PDFs to search PDF before solr gets near it. I notice that tika is very slow at parsing some PDFs. I don't need any metadata (which I suspect is slowing tika down), just the text. Has anyone used an alternative PDF text

Re: SolrIndexSearcher count

2017-12-07 Thread Shawn Heisey
On 12/5/2017 6:02 AM, Rick Dig wrote: > is it normal to have many instances (100+) of SolrIndexSearchers to be open > at the same time? Our Heap Analysis shows this to be the case. > > We have autoCommit for every 5 minutes, with openSearcher=true, would this > close the old searcher and create a

Re: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Shawn Heisey
On 12/7/2017 6:55 AM, Gilcan Machado wrote: > I have a Solr 4 in production (+ Drupal). > > And I want to migrate Solr to versoin 7 (at the end). > > But I guess it's more safe to migrate from 4 to 5 first. > > Anyway, I'm searching a lot and I couldn't find a documentation that shows > how to

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Thanks a lot for the response. We did not change schema or config. We simply opened 4.5 indexes with 4.10 libraries. Thank you, Rajeswari On 12/7/17, 3:17 PM, "Shawn Heisey" wrote: On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Shawn Heisey
On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction. > Trying to see if any optimization done to decrease the index sizes , couldn’t > locate. If anyone knows why please share. Here's a history where you can see the a

Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Hi, We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction. Trying to see if any optimization done to decrease the index sizes , couldn’t locate. If anyone knows why please share. Thank you, Rajeswari

RE: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
cc list: Hello Mikhail, Well, disregarding the warning notes in SolrResourceLoader, my meager patch adds TransformerFactory, and the code now runs well. I obviously lack the understanding of this patch with regard to SOLR-8311, but we are fine. So, the patch and the custom code using it are

Re: Howto search for § character

2017-12-07 Thread Shawn Heisey
On 12/7/2017 9:37 AM, Bernd Schmidt wrote: > Indeed, I saw in the analysis tab of the solr admin that the § char will be > removed when using type text_general. > But in this use case we want to make a full text search like "_text_:§45" or > "_text_:§*" to find words starting with §. > We need a

Re: Howto search for § character

2017-12-07 Thread Erick Erickson
You have to use a different analysis chain. There are about a zillion options, here's a _start_: https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html You'll probably be defining one similar to how text_general is defined, a then use your new type in your .

Re: Time-Series data indexing into Solr

2017-12-07 Thread Erick Erickson
You can also use "implicit" (sometimes called "manual") routing. This allows you to create shards on the fly so one pattern is to create, say, a shard per day. Say you have 30 day retention requirements: You can create a new shard every day and delete any shards 31 or more days old. There are

Re: Howto search for § character

2017-12-07 Thread Bernd Schmidt
Indeed, I saw in the analysis tab of the solr admin that the § char will be removed when using type text_general. But in this use case we want to make a full text search like "_text_:§45" or "_text_:§*" to find words starting with §. We need a text field here, not a string field! What is your

Re: Howto search for § character

2017-12-07 Thread Erick Erickson
The admin UI/(select core)/analysis page will help you see exactly what happens. Additionally, the "schema browser" bit will show you exactly what's in the index, i.e. the terms as they actually appear after all the analysis chain is completed. Those will definitively tell you what exactly happens

Re: Howto search for § character

2017-12-07 Thread Shawn Heisey
On 12/6/2017 9:09 AM, Bernd Schmidt wrote: > we have defined a field named "_text_" for a full text search based on > field-type "text_general": > stored="false"/>" > > When trying to search for the "§" character, we have strange behaviour: > > q=_text_:§ AND entityClass:StructureNodeImpl  => 

Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be able to re-index quickly so you can try out different analysis chains. XSLT may not be fast enough for this if you have millions of docs. So I would be inclined to save the docs to a normal filesystem, perhaps in

Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew, Do you have some sort of script calling xslt? Sorry, I do not know Scala and I did not have time to look into your spark utils. The script or Scala could then shell out to curl, or if it is python it could use the request library to send a doc to Solr. Extra points for batching the

Re: indexing XML stored on HDFS

2017-12-07 Thread Matthew Roth
Yes the post tool would also be an acceptable option and one I am familiar with. However, I also am not seeing exactly how I would query hdfs. The hadoop-solr [0 ] tool by lucidworks looks the most promising. I have a meeting to attend

Re: No Live SolrServer available to handle this request

2017-12-07 Thread Steve Rowe
Hi Selvam, This sounds like it may be a bug - could you please create a JIRA? (See for more info.) Thanks, -- Steve www.lucidworks.com > On Dec 6, 2017, at 9:56 PM, Selvam Raman wrote:

RE: Time-Series data indexing into Solr

2017-12-07 Thread Markus Jelsma
One of our collections is time-series data, processing hundreds of queries per second. But apart from having a time field, set it indexed and docValues enabled, i wouldn't know about any specific recommendations. -Original message- > From:Greenhorn Techie >

Re: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Gilcan Machado
Jesus... Thank you very much!!! []s Gil 2017-12-07 11:58 GMT-02:00 Markus Jelsma : > https://lucene.apache.org/solr/5_0_0/changes/Changes.html > > > > -Original message- > > From:Gilcan Machado > > Sent: Thursday 7th December

RE: Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Markus Jelsma
https://lucene.apache.org/solr/5_0_0/changes/Changes.html -Original message- > From:Gilcan Machado > Sent: Thursday 7th December 2017 14:55 > To: solr-user@lucene.apache.org > Subject: Where can I find documentation to migrate Solr 4 to 5? > > Hi. > > I

RE: TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
Created SOLR-11735 for tracking. https://issues.apache.org/jira/browse/SOLR-11735 -Original message- > From:Markus Jelsma > Sent: Thursday 7th December 2017 14:49 > To: Solr-user > Subject: TransformerFactory does not support

Where can I find documentation to migrate Solr 4 to 5?

2017-12-07 Thread Gilcan Machado
Hi. I have a Solr 4 in production (+ Drupal). And I want to migrate Solr to versoin 7 (at the end). But I guess it's more safe to migrate from 4 to 5 first. Anyway, I'm searching a lot and I couldn't find a documentation that shows how to pick a Solr 4 (in full production) and upgrade to a

Re: Issue while searching with escape characters

2017-12-07 Thread Roopesh Uniyal
Thanks Emir. Got it fixed. End customer's solr was not having the records itself. They were trying to compare apples with oranges. On Thu, Dec 7, 2017 at 7:43 AM Emir Arnautović wrote: > Hi Roopesh, > If escaping special char with \ does not result in error but in

TransformerFactory does not support SolrCoreAware

2017-12-07 Thread Markus Jelsma
Hi, I'd love to have this supported, but SOLR-8311 states there are issues, and i lack the understanding of the mentioned issues. So, can i add it? Many thanks, Markus

Re: Issue while searching with escape characters

2017-12-07 Thread Emir Arnautović
Hi Roopesh, If escaping special char with \ does not result in error but in no results, then it might be worth checking if your indexing is ok - does it strip parenthesis. Can you share example query and schema snippet where you define your field and fieldType. Regards, Emir -- Monitoring -

Re: Solr DR Replication

2017-12-07 Thread Greenhorn Techie
Any thoughts / help on this please. Thanks in advance. On Wed, 6 Dec 2017 at 16:21 Greenhorn Techie wrote: > Hi, > > We are on Solr 5.5.2 and wondering what is the best mechanism for > replicating Solr indexes from a Disaster Recovery perspective. As I > understand

Time-Series data indexing into Solr

2017-12-07 Thread Greenhorn Techie
Hi, Is there any recommended approach to index and search time-series data in Solr? Thanks in Advance.

Re: Does apache solr stores the file?

2017-12-07 Thread Charlie Hull
On 06/12/2017 10:10, Gora Mohanty wrote: On 6 December 2017 at 10:39, Munish Kumar Arora wrote: So the questions are, 1. Can I get the PDF content? 2. does Solr stores the actual file somewhere? a. If it stores then where it does? b. If it