RE: Specialized Solr Application

2018-04-20 Thread Allison, Timothy B.
>1) the toughest pdfs to identify are those that are partly
searchable (text) and partly not (image-based text).  However, I've
found that such documents tend to exist in clusters.
Agreed.  We should do something better in Tika to identify image-only pages on 
a page-by-page basis, and then ship those with very little text to tesseract.  
We don't currently do this.

>3) I have indexed other repositories and noticed some silent
failures (mostly for large .doc documents).  Wish there was some way
to log these errors so it would be obvious what documents have been
excluded.
Agreed on the Solr side.  You can run `java -jar tika-app.jar -J -t -i 
 -o ` and then tika-eval on the  to count 
exceptions, even exceptions in embedded documents, which are now silently 
ignored. ☹

>   4) I still don't understand the use of tika.eval - is that an
application that you run against a collection or what?
Currently, it is set up to run against a directory of extracts (text+metadata 
extracted from pdfs/word/etc).  It will give you info about # of exceptions, 
lang id, and some other statistics that can help you get a sense of how well 
content extraction worked.  It wouldn't take much to add an adapter that would 
have it run against Solr to run the same content statistics.

>5) I've seen reference to tika-server - but I have no idea on how
that tool might be usefully applied.
 We have to harden it, but the benefit is that you isolate the tika process in 
its own jvm so that it can't harm Solr.  By harden, I mean we need to spawn a 
child process and set a parent process that will kill and restart on oom or 
permanent hang.  We don't have that yet.  Tika very rarely runs into serious, 
show stopping problems (kill -9 just might solve your problem).  If you only 
have a few 10s of thousands of docs, you aren't likely to run into these 
problems.  If you're processing a few million, esp noisy things that come of 
the internet, you're more likely to run into these kinds of problems.

>6) Adobe Acrobat Pro apparently has a batch mode suitable for
flagging unsearchable (that is, image-based) pdf files and fixing them.
 Great.  If you have commercial tools available, use them.  IMHO, we have a 
ways to go on our OCR integration with PDFs.

>7) Another problem I've encountered is documents that are themselves
a composite of other documents (like an email thread).  The problem
is that a hit on such a document doesn't tell you much about the
true relevance of each contained document.  You have to do a
laborious manual search to figure it out.


Agreed.  Concordance search can be useful for making sense of large documents 
 https://github.com/mitre/rhapsode  The other 
thing that can be useful for handling genuine attachments (pdfs inside of 
email) is to treat the embedded docs as their own standalone/child doc (see 
github link and SOLR-7229.


>8) Is there a way to return the size of a matching document (which,
I think, would help identify non-searchable/image documents)?
Not that I'm aware of, but that's one of the stats calculated by tika-eval.  
Length of extracted string, number of tokens, number of alphabetic tokens, 
number of "common words" (I took top 20k most common words from Wikipedia dumps 
per lang)...and others.

Cheers,

Tim


RE: Specialized Solr Application

2018-04-18 Thread Allison, Timothy B.
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during 
content extraction.[1]  I had two big concerns when I heard of your task:



1) image only pdfs, which can parse without problem, but which might yield 0 
content.

2) emails (see, e.g. SOLR-12048)



It sounds like you're taking care of 1), and 2) doesn't apply because you're 
using Tika (although note that we've made some major changes to our RFC822 
parsing in the upcoming Tika 1.18).  So, no need to read further! 



In general, surprising things can happen during the content extraction phase, 
and unless you are monitoring/measuring/evaluating what's extracted, your 
search system can yield results that are downright dangerous if you assume that 
the full stack is actually working.



I worked with one batch of documents where HALF of the Excel files weren't 
being parsed.  They all had the same quirk which caused an exception in POI, 
and because they were inside zip files, and Tika's legacy/default behavior is 
to silently ignore embedded exceptions -- the owners of the search system had 
_no idea_ that they'd never be able to find those documents.  At one point, 
Tika wasn't extracting sdt form fields in docx or form fields in pdf...at 
all...imagine if your document set was a bunch docx with sdts or pdfs with form 
fields...  We just fixed a bug to pull text from joined shapes in ppt...we've 
been missing that text for years!



Those are a few horror stories, I have many, and there are countless more yet 
to be discovered!



The goal of tika-eval[2] is to allow you to see if things don't look right 
based on your expectations.[3]  It doesn't help with indexing at all per se, 
but it can allow you to see odd things and 1) change your processing pipeline 
(add OCR where necessary or use an alternate parser for some file formats) or 
2) raise an issue to fix bugs in the content extraction libraries, or at least 
3) recognize that you aren't getting reliable content out of ~x% of your 
documents.  If manually checking PDFs to determine whether or not to run OCR is 
a hassle, run tika-eval and identify those docs that have a low word count/page 
ratio.



Couple of handfuls of Welsh documents; I thought we only had English...what?!  
No, that's just bad content extraction (character mapping failure in the PDF or 
other mojibake).  Average token length in this document is 1, and it is 
supposed to be English...what?  No, that's the spacing problem that Erick 
Mentioned.  Average words per page in some pdfs = 2?  No, that's an image-only 
pdf...that needs to go through OCR.  Ratio of out of vocabulary words = 
90%...no that's character encoding mojibake.





> I was recently indexing a set of about

13,000 documents and at one point, a document caused solr to crash.  I had to 
restart it.  I removed the offending document, and restarted the indexing.  It 
then eventually happened again, so I did the same thing.



Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we 
can _try_ to fix the underlying bug if there is one.  Sometimes, though, our 
parsers require far more memory that is ideal. 



If you have questions about tika-eval, please ask over on the Tika list.  
Apologies for too many words.  Thank you, all, for this discussion!



Cheers,



   Tim





P.S. On metadata author vs. creator, for a good while, we've been trying to 
standardize to Dublin core -- dc:creator.  If you see areas for improvement, 
let us know.



[1] https://www.slideshare.net/TimAllison6/haystack-2018-apachetikaevaltallison

[2] https://wiki.apache.org/tika/TikaEval

[3] Obviously, without ground truth, there is no automated way to detect the 
sdt/form field/grouped text box problems, but tika-eval does what it can to 
identify and count:

a) catastrophic problems (oom, permanent hang)

b) catchable exceptions

c) corrupted text

d) nearly entirely missing text






RE: Specialized Solr Application

2018-04-17 Thread Allison, Timothy B.
+1 to Charlie's guidance.

And...

>60,000 documents, mostly pdfs and emails.
> However, there's a premium on precision (and recall) in searches.

Please, oh, please, no matter what you're using for content/text extraction 
and/or OCR, run tika-eval[1] on the output to ensure that that you are getting 
mostly language-y content out of your documents.  Ping us on the Tika user's 
list if you have any questions.

Bad text, bad search. 

[1] https://wiki.apache.org/tika/TikaEval

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Tuesday, April 17, 2018 4:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Specialized Solr Application

On 16/04/2018 19:48, Terry Steichen wrote:
> I have from time-to-time posted questions to this list (and received 
> very prompt and helpful responses).  But it seems that many of you are 
> operating in a very different space from me.  The problems (and
> lessons-learned) which I encounter are often very different from those 
> that are reflected in exchanges with most other participants.

Hi Terry,

Sounds like a fascinating use case. We have some similar clients - small scale 
law firms and publishers - who have taken advantage of Solr.

One thing I would encourage you to do is to blog and/or talk about what you've 
built. Lucene Revolution is worth applying to talk at and if you do manage to 
get accepted - or if you go anyway - you'll meet lots of others with similar 
challenges and come away with a huge amount of useful information and contacts. 
Otherwise there are lots of smaller Meetup events (we run the London, UK one).

Don't assume just because some people here are describing their 350 billion 
document learning-to-rank clustered monster that the small applications don't 
matter - they really do, and the fact that they're possible to build at all is 
a testament to the open source model and how we share information and tips.

Cheers

Charlie
> 
> So I thought it would be useful to describe what I'm about, and see if 
> there are others out there with similar implementations (or interest 
> in moving in that direction).  A sort of pay-forward.
> 
> We (the Lakota Peoples Law Office) are a small public interest, pro 
> bono law firm actively engaged in defending Native American North 
> Dakota Water Protector clients against (ridiculously excessive) criminal 
> charges.
> 
> I have a small Solr (6.6.0) implementation - just one shard.  I'm 
> using the cloud mode mainly to be able to implement access controls.  
> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 
> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with 
> a total of about 60,000 documents, mostly pdfs and emails.  The 
> indexed documents are partly our own files and partly those we obtain 
> through legal discovery (which, surprisingly, is allowed in ND for 
> criminal cases).  We only have a few users (our lawyers and a couple 
> of researchers mostly), so traffic is minimal.  However, there's a 
> premium on precision (and recall) in searches.
> 
> The document repository is local to the server.  I piggyback on the 
> embedded Jetty httpd in order to serve files (selected from the 
> hitlists).  I just use a symbolic link to tie the repository to 
> Solr/Jetty's "webapp" subdirectory.
> 
> We provide remote access via ssh with port forwarding.  It provides 
> very snappy performance, with fully encrypted links.  Appears quite stable.
> 
> I've had some bizarre behavior apparently caused by an interaction 
> between repository permissions, solr permissions and the ssh link.  I 
> seem "solved" for the moment, but time will tell for how long.
> 
> If there are any folks out there who have similar requirements, I'd be 
> more than happy to share the insights I've gained and problems I've 
> encountered and (I think) overcome.  There are so many unique parts of 
> this small scale, specialized application (many dimensions of which 
> are not strictly internal to Solr) that it probably won't be 
> appreciated to dump them on this (excellent) Solr list.  So, if you 
> encounter problems peculiar to this kind of setup, we can perhaps help 
> handle them off-list (although if they have more general Solr 
> application, we should, of course, post them to the list).
> 
> Terry Steichen
> 


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
There's also, of course, tika-server. 

No matter the method, it is always best to isolate Tika to its own jvm, vm or m.

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 4:15 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://github.com/mattflax/dropwizard-tika-server written by a colleague of 
mine at Flax. Hope this is useful.

Cheers

Charlie




RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
+1

https://lucidworks.com/2012/02/14/indexing-with-solrj/

We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 


-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 
> we have in our Sharepoint system. I have used the tika-app.jar 
> directly to extract the document in question and it does _not_ throw 
> an exception and extract the contents just fine. So it would seem Solr 
> is doing something different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the 
> HTML document to Tika. As Tika limits nested elements to 100, this 
> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 
> XML element nesting. This is metioned in TIKA-2091 
> (https://issues.apache.org/ 
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 
> "solution" is to use Tika's default parsing/mapping mechanism but no 
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr 
> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 
> use Tika's implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.
>


RE: Query redg : diacritics in keyword search

2018-03-30 Thread Allison, Timothy B.
For a simple illustration of Charlie's point and a side bonus on the 78 reasons 
to use the ICUFoldingFilter if you happen to be processing Arabic script 
languages, see slides 31-33:

https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf
 

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, March 29, 2018 9:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Query redg : diacritics in keyword search

On 29/03/2018 14:12, Peter Lancaster wrote:
> Hi,
> 
> You don't say whether the AsciiFolding filter is at index time or query time. 
> In any case you can easily look at what's happening using the admin analysis 
> tool which helpfully will even highlight where the analysed query and index 
> token match.
> 
> That said I'd expect what you want to work if you simply use  class="solr.ASCIIFoldingFilterFactory"/> on both index and query.

Simply put:

You use the filter at indexing time to collapse any variants of a term into a 
single variant, which is then stored in your index.

You use the filter at query time to collapse any variants of a term that users 
type into a single variant, and if this exists in your index you get a match.

If you don't use the same filter at both ends you won't get a match.

Cheers

Charlie

> 
> Cheers,
> Peter.
> 
> -Original Message-
> From: Paul, Lulu [mailto:lulu.p...@bl.uk]
> Sent: 29 March 2018 12:03
> To: solr-user@lucene.apache.org
> Subject: Query redg : diacritics in keyword search
> 
> Hi,
> 
> The keyword search Carré  returns values Carré and Carre (this works 
> well as I added the tokenizer  class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/> in 
> the schema config to enable returning of both sets of values)
> 
> Now looks like we want Carre to return both Carré and Carre (and this dosen’t 
> work. Solr only returns Carre) – any ideas on how this scenario can be 
> achieved?
> 
> Thanks & Best Regards,
> Lulu Paul
> 
> 
> 
> **
> 
> Experience the British Library online at www.bl.uk 
> The British Library’s latest Annual Report and Accounts : 
> www.bl.uk/aboutus/annrep/index.html dex.html> Help the British Library conserve the world's knowledge. 
> Adopt a Book. www.bl.uk/adoptabook
> The Library's St Pancras site is WiFi - enabled
> **
> ***
> The information contained in this e-mail is confidential and may be legally 
> privileged. It is intended for the addressee(s) only. If you are not the 
> intended recipient, please delete this e-mail and notify the 
> postmas...@bl.uk : The contents of this e-mail must 
> not be disclosed or copied without the sender's consent.
> The statements and opinions expressed in this message are those of the author 
> and do not necessarily reflect those of the British Library. The British 
> Library does not take any responsibility for the views of the author.
> **
> ***
> Think before you print
> 
> 
> This message is confidential and may contain privileged information. You 
> should not disclose its contents to any other person. If you are not the 
> intended recipient, please notify the sender named above immediately. It is 
> expressly declared that this e-mail does not constitute nor form part of a 
> contract or unilateral obligation. Opinions, conclusions and other 
> information in this message that do not relate to the official business of 
> findmypast shall be understood as neither given nor endorsed by it.
> 
> 
> __
> 
> 
> This email has been checked for virus and other malicious content prior to 
> leaving our network.
> __
> 
> 


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
Nice.  Thank you!

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: Thursday, February 15, 2018 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr search word NOT followed by another word

Hi,
I did not provide the right query. If you query as {!complexphrase 
df=name}”Leonardo -da -Vinci” all works as expected. This matches all three 
doc. 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/


RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I just updated the SpanQueryParser (LUCENE-5205) and its Solr plugin 
(SOLR-5410) for master and 7.2.1.

What version of Solr are you using and which version of the plugin?

These should be available on maven central shortly: version 7.2-0.1

org.tallison.solr
solr-5410
7.2-0.1


Or you can fork: https://github.com/tballison/lucene-addons/tree/7.2-0.1


-Original Message-
From: ivan [mailto:i...@presstoday.com] 
Sent: Wednesday, February 14, 2018 6:42 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr search word NOT followed by another word

Hi Timothy,

i'm trying to use your Parser, but i'm having some trouble with the versions of 
solr\lucene.
I'm trying to use version 6.4.1 but i'm facing a lot of incompatibilities with 
version 5. Is there any updated version of the plugin?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I've been away from the ComplexQueryParser for a while, and I was wrong when I 
said in my earlier email that no currently included Solr parse generates a 
SpanNotQuery.  

You're right, Emir, that the ComplexQueryParser does generate a SpanNotQuery, 
and, y, I just tried this with 7.2.1, and it retrieves "Leonardo is the name of 
Leonardo da Vinci".

However, if fails to retrieve :
a) "Leonardo da is the name of Leonardo da Vinci"
and
b) "Leonardo Vinci is the name of Leonardo da Vinci"

because the SpanNot exclude is a SpanOr ("da" or "vinci") after the rewrite: 

spanNot(name:leonardo, spanNear([name:leonardo, spanOr([name:da, name:vinci])], 
0, true), 0, 0)







-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: Tuesday, February 13, 2018 11:23 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr search word NOT followed by another word

Hi Ivan,
Which version of Solr do you use? I’ve just tried it on 6.5.1 and it returned 
expected.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 13 Feb 2018, at 16:08, ivan  wrote:
> 
> Hi Emir,
> 
> unfortunately that does not work, since i'm not getting a match for my 
> third example ("Leonardo is the name of Leonardo da Vinci") because i 
> have both "Leonardo" and "Leonardo da Vinci" in the same field. I'm 
> fine with having "Leonardo da Vinci" as long as i have another 
> "Leonardo" (NOT followed by da Vinci).
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



RE: Solr search word NOT followed by another word

2018-02-14 Thread Allison, Timothy B.
In process, should finish by end of this week.  I had to put SlowFuzzyQuery 
back in, and I discovered SOLR-11976 while trying to upgrade.  I'll have to do 
a workaround until that is fixed.

-Original Message-
From: simon [mailto:mtnes...@gmail.com] 
Sent: Monday, February 12, 2018 1:21 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Solr search word NOT followed by another word

Tim:

How up to date is the Solr-5410  patch/zip in JIRA ?.  Looking to use the Span 
Query parser in 6.5.1, migrating to 7.x sometime soon.

Would love to see these committed !

-Simon

On Mon, Feb 12, 2018 at 10:41 AM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> That requires a SpanNotQuery.  AFAIK, there is no way to do this with 
> the current parsers included in Solr.
>
> My SpanQueryParser does cover this, and I'm hoping to port it to 7.x 
> today or tomorrow.
>
> Syntax would be "Leonardo [da vinci]"!~0,1
>
> https://issues.apache.org/jira/browse/LUCENE-5205
>
> https://github.com/tballison/lucene-addons/tree/master/lucene-5205
>
> https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205
>
> With Solr wrapper: https://issues.apache.org/jira/browse/SOLR-5410
>
>
> -Original Message-
> From: ivan [mailto:i...@presstoday.com]
> Sent: Monday, February 12, 2018 6:00 AM
> To: solr-user@lucene.apache.org
> Subject: Solr search word NOT followed by another word
>
> What i'm trying to do is to only get results for "Leonardo" when is 
> not followed by "da vinci".
> So any result containing "Leonardo" (not followed by "da vinci") is 
> fine even if i have "Leonardo da vinci" in the result. I want to 
> filter out only the results where i don't have "Leonardo" without "da vinci".
>
> Examples:
> "Leonardo abc abc abc"   OK
> "Leonardo da vinci abab"  KO
> "Leonardo is the name of Leonardo da Vinci"  OK
>
>
> I can't seem to find any way to do that using solr queries. I can't 
> use regex (i have a tokenized text field) and any combination of 
> boolean logic doesn't seem to work.
>
> Any help?
> Thanks
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


RE: Solr search word NOT followed by another word

2018-02-12 Thread Allison, Timothy B.
That requires a SpanNotQuery.  AFAIK, there is no way to do this with the 
current parsers included in Solr.

My SpanQueryParser does cover this, and I'm hoping to port it to 7.x today or 
tomorrow.

Syntax would be "Leonardo [da vinci]"!~0,1

https://issues.apache.org/jira/browse/LUCENE-5205

https://github.com/tballison/lucene-addons/tree/master/lucene-5205

https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205

With Solr wrapper: https://issues.apache.org/jira/browse/SOLR-5410


-Original Message-
From: ivan [mailto:i...@presstoday.com] 
Sent: Monday, February 12, 2018 6:00 AM
To: solr-user@lucene.apache.org
Subject: Solr search word NOT followed by another word

What i'm trying to do is to only get results for "Leonardo" when is not 
followed by "da vinci".
So any result containing "Leonardo" (not followed by "da vinci") is fine even 
if i have "Leonardo da vinci" in the result. I want to filter out only the 
results where i don't have "Leonardo" without "da vinci".

Examples:
"Leonardo abc abc abc"   OK
"Leonardo da vinci abab"  KO
"Leonardo is the name of Leonardo da Vinci"  OK


I can't seem to find any way to do that using solr queries. I can't use regex 
(i have a tokenized text field) and any combination of boolean logic doesn't 
seem to work.

Any help?
Thanks




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
A slightly more refined answer...  In my experience with the systems I've 
worked with, Porter and other stemmers can be useful as a "fallback field" with 
a really low boost, but you should be really careful if you're only searching 
on one field.

Cannot recommend Doug Turnbull and John Berryman's "Relevant Search" enough on 
how to layer fields...among many other great insights: 
https://www.manning.com/books/relevant-search


 -Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, November 30, 2017 9:20 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com]
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result 

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com] 
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result term 
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!  

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com] 
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky 
wrote:

> I wish to understand if I can do something to get in result term "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd 
> together. It also takes you down a different analysis pathway, as many 
> analysis components do not work with multiple terms. Look into the 
> SolrAdmin console, and use the analysis tab to understand what is 
> going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
>  wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many 
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”, 
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


RE: Complexphrase treats wildcards differently than other query parsers

2017-10-09 Thread Allison, Timothy B.
  Right.  Sorry.

Despite appearances to the contrary, I'm not a bot designed to lead you down 
the garden path of debugging for yourself with the goal of increasing the size 
of the Solr contributor pool...

I confirmed the failure in 6.x, but all seems to work in 7.x and trunk.  I 
opened SOLR-11450 and attached a unit test based on your correction of mine. 

Thank you, again!


-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Monday, October 9, 2017 8:39 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks again, Tim,
following your recipe, I was able to write a failing test:

assertQ(req("q", "{!complexphrase} iso-latin1:cr\u00E6zy*")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I 
originally reported, namely that CPQP does not analyse it because of the 
wildcard and thus does not hit the charfilter from the query side.


2017-10-06 20:54 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>:

> That could be it.  I'm not able to reproduce this with trunk.  More 
> next week.
>
> In trunk, if I add this to schema15.xml:
>   
> 
>mapping="mapping- ISOLatin1Accent.txt"/>
>   
> 
>   
>stored="true"/>
>
> This test passes.
>
>   @Test
>   public void testCharFilter() {
> assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
> assertU(commit());
> assertU(optimize());
>
> assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:traen")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>
> assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
> , "//result[@numFound='1']"
> , "//doc[./str[@name='id']='1']"
> );
>   }
>
>
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Friday, October 6, 2017 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other 
> query parsers
>
> Thanks a lot for your effort, Tim.
>
> Looking at it from the Solr side, I see some use of local classes. The 
> snippet below in particular caught my eye (in 
> solr/core/src/java/org/apache/ solr/search/ComplexPhraseQParserPlugin.java).
> The instance of ComplexPhraseQueryParser is not the clean one from 
> Lucene, but a modified one. If any of the modifications messes with 
> the analysis logic, well then that might answer it.
>
> What do you make of it?
>
> lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
>

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-06 Thread Allison, Timothy B.
That could be it.  I'm not able to reproduce this with trunk.  More next week.

In trunk, if I add this to schema15.xml:
  

  
  

  
  

This test passes.

  @Test
  public void testCharFilter() {
assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
assertU(commit());
assertU(optimize());

assertQ(req("q", "{!complexphrase} iso-latin1:craezy")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:traen")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:caezy~1")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:crae*")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:*aezy")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:crae*y")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:\"craezy traen\"")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:\"caezy~1 traen\"")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:\"craez* traen\"")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:\"*aezy traen\"")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);

assertQ(req("q", "{!complexphrase} iso-latin1:\"crae*y traen\"")
, "//result[@numFound='1']"
, "//doc[./str[@name='id']='1']"
);
  }



-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Friday, October 6, 2017 6:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks a lot for your effort, Tim.

Looking at it from the Solr side, I see some use of local classes. The snippet 
below in particular caught my eye (in 
solr/core/src/java/org/apache/solr/search/ComplexPhraseQParserPlugin.java).
The instance of ComplexPhraseQueryParser is not the clean one from Lucene, but 
a modified one. If any of the modifications messes with the analysis logic, 
well then that might answer it.

What do you make of it?

lparser = new ComplexPhraseQueryParser(defaultField, getReq().getSchema().
getQueryAnalyzer())
{
protected Query newWildcardQuery(org.apache.lucene.index.Term t) { try { 
org.apache.lucene.search.Query wildcardQuery = reverseAwareParser.
getWildcardQuery(t.field(), t.text());
setRewriteMethod(wildcardQuery);
return wildcardQuery;
} catch (SyntaxError e) {
throw new RuntimeException(e);
}
}
private Query setRewriteMethod(org.apache.lucene.search.Query query) { if 
(query instanceof MultiTermQuery) {
((MultiTermQuery) query).setRewriteMethod( 
org.apache.lucene.search.MultiTermQuery.SCORING_BOOLEAN_REWRITE);
}
return query;
}
protected Query newRangeQuery(String field, String part1, String part2, boolean 
startInclusive, boolean endInclusive) { boolean reverse = 
reverseAwareParser.isRangeShouldBeProtectedFromReverse(field,
part1);
return super.newRangeQuery(field,
reverse ? reverseAwareParser.getLowerBoundForReverse() : part1, part2, 
startInclusive || reverse, endInclusive); } } ;

Thanks,
Bjarke




RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
After some more digging, I'm wrong even at the Lucene level.

When I use the CustomAnalyzer and make my UC vowel mock filter MultitermAware, 
I get this with Lucene in trunk:

"the* quick~" name:thE* name:qUIck~2 name:thE name:qUIck

So, there's room for improvement with phrases, but the regular multiterms 
should be ok.

Still no answer for you...

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>:

> There's every chance that I'm missing something at the Solr level, but 
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still 
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the 
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck



RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
Prob the usual reasons...no one has submitted a patch yet, or could be a 
regression after LUCENE-7355.

See also:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201407.mbox/%3c1d06a081892adf4589bd83ee24b9dc3025971...@imcmbx02.mitre.org%3E

I'll take a look.


-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Thursday, October 5, 2017 8:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in 
ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. <talli...@mitre.org>:

> There's every chance that I'm missing something at the Solr level, but 
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still 
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the 
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Complexphrase treats wildcards differently than other 
> query parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly 
> find the JIRA.  Let me take a look.
>
> Best,
>
>  Tim
>
> This was one of my initial reasons for my SpanQueryParser 
> LUCENE-5205[1] and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other 
> query parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović <emir.arnauto...@sematext.com>:
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest 
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would 
> explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > > <morten...@eluence.com>
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains 
> > > how analysis will be performed with wildcards if the analyzers are 
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I 
> > > don't think you are right in your statement about the majority of 
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> > ><emir.arnauto...@sematext.com
> > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis 
> > >> chain but wildcard. The majority of query parsers do not analyse 
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> > &

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
There's every chance that I'm missing something at the Solr level, but it 
_looks_ at the Lucene level, like ComplexPhraseQueryParser is still not 
applying analysis to multiterms.

When I call this on 7.0.0:
   QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName, analyzer);
return qp.parse(qString);

 where the analyzer is a mock "uppercase vowel" analyzer[1] and the qString is;

"the* quick~" the* quick~ the quick

I get this:
"the* quick~" name:the* name:quick~2 name:thE name:qUIck


[1] 
https://github.com/tballison/lucene-addons/blob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117

-----Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, October 5, 2017 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Complexphrase treats wildcards differently than other query parsers

What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the 
JIRA.  Let me take a look.

Best,

 Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and 
[2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
Sent: Thursday, October 5, 2017 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović <emir.arnauto...@sematext.com>:

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest 
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > <morten...@eluence.com>
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how 
> > analysis will be performed with wildcards if the analyzers are 
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I 
> > don't think you are right in your statement about the majority of 
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> ><emir.arnauto...@sematext.com
> >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis 
> >> chain but wildcard. The majority of query parsers do not analyse 
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >>> <morten...@eluence.com>
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my 
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to 
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives 
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis 
> >>> chain
> for
> >>> multiterms, even though components and in particular 
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the 
JIRA.  Let me take a look.

Best,

 Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and 
[2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Thursday, October 5, 2017 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest 
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > 
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how 
> > analysis will be performed with wildcards if the analyzers are 
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I 
> > don't think you are right in your statement about the majority of 
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> > >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis 
> >> chain but wildcard. The majority of query parsers do not analyse 
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >>> 
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my 
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to 
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives 
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis 
> >>> chain
> for
> >>> multiterms, even though components and in particular 
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


RE: DataImport Handler Out of Memory

2017-09-27 Thread Allison, Timothy B.
https://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F


-Original Message-
From: Deeksha Sharma [mailto:dsha...@flexera.com] 
Sent: Wednesday, September 27, 2017 1:40 PM
To: solr-user@lucene.apache.org
Subject: DataImport Handler Out of Memory

I am trying to create indexes using dataimport handler (Solr 5.2.1). Data is in 
mysql db and the number of records are more than 3.5 million. My solr server 
stops due to OOM (out of memory error). I tried starting solr by giving 12GB of 
RAM but still no luck.


Also, I see that Solr fetches all the documents in 1 request. Is there a way to 
configure Solr to stream the data from DB or any other solution somewhere may 
have tried?


Note: When my records are nearly 2 Million, I am able to create indexes by 
giving Solr 10GB of RAM.


Your help is appreciated.



Thanks

Deeksha




RE: Solr fields for Microsoft files, image files, PDF, text files

2017-09-25 Thread Allison, Timothy B.
bq: How do I get a list of all valid field names based on the file type

bq: You don't. At least I've never found any. Plus various document formats 
will allow custom meta-data fields so there's no definitive list.

It would be trivial to add field counts per mime to tika-eval.  If you're 
interested in this, please open a ticket on Tika's JIRA.


TIKA-2440 Remove Furigana/phonetic as default for xlsx?

2017-08-09 Thread Allison, Timothy B.
Solrians,
  We have a request to drop phonetic strings from xlsx as the default in Tika.  
I'm not familiar enough with Japanese to know if users would generally expect 
to be able to search on these as well as the original.  The current practice is 
to include them.
  Any recommendations?  Thank you!

   Best,

 Tim

-Original Message-
From: Takahiro Ochi (JIRA) [mailto:j...@apache.org] 
Sent: Tuesday, August 8, 2017 2:28 AM
To: d...@tika.apache.org
Subject: [jira] [Created] (TIKA-2440) Phonetic strings handling for 
multilingual environments.

Takahiro Ochi created TIKA-2440:
---

 Summary: Phonetic strings handling for multilingual environments.
 Key: TIKA-2440
 URL: https://issues.apache.org/jira/browse/TIKA-2440
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Takahiro Ochi
Priority: Minor


Hi there,

I would like to propose an idea to improve phonetic strings handling for 
multilingual environments. I believe Tika should not concatenate phonetic 
strings because text with phonetic strings is recognized as noisy text in most 
situations of natural language processing.

Excel files include phonetic strings in some languages such as Japanese, 
Chinese and so on. Apache POI concatenates phonetic strings onto the shared 
strings when Tika extract text from Excel files.

Recent Apache POI has an switch flag for phonetic strings concatination as 
follows:
https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)

Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the 
simple patch for my idea.


{code:java}
$ diff -ru XSSFExcelExtractorDecorator.java 
./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
--- XSSFExcelExtractorDecorator.java2017-06-10 19:13:33.355412625 +0900
+++ 
./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 2017-06-10 19:14:30.452411830 +0900
@@ -130,7 +130,7 @@
 styles = xssfReader.getStylesTable();

 iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
-strings = new ReadOnlySharedStringsTable(container);
+strings = new ReadOnlySharedStringsTable(container,false);
 } catch (InvalidFormatException e) {
 throw new XmlException(e);
 } catch (OpenXML4JException oe) {

{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


RE: Arabic words search in solr

2017-08-02 Thread Allison, Timothy B.
+1

I was hoping to use this as a case for arguing for turning off an overly 
aggressive stemmer, but I checked on your 10 docs and query, and David is 
right, of course -- if you change the default operator to AND, you only get the 
one document back that you had intended to.

I can still use this as a case for getting on my Unicode normalization soapbox 
and +1'ing your use of the ICUFoldingFilter.  With no token filters, you get 4 
results; when you add the ICUFoldingFilter, you get 8 results; and when you add 
in the Arabic stemmer, you get all 10.  Not that you need this, but see slide 
33 of [1], where we show 78 Unicode variants for "America" in ~800k docs in an 
Arabic script language.  Without Unicode normalization, users might get 1/2 the 
documents back or far, far fewer...and they wouldn't even know what they were 
missing!

[1] 
https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf

-Original Message-
From: David Hastings [mailto:hastings.recurs...@gmail.com] 
Sent: Wednesday, August 2, 2017 9:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

perhaps change your default operator to AND instead of OR if thats what you are 
expecting for a result

On Wed, Aug 2, 2017 at 8:57 AM, mohanmca01  wrote:

> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, 
> but it was getting wrong results. i was searching for "شرطة ازكي" and 
> it was showing me the result that am looking for, plus irrelevant 
> result which either have the first or second word that i have typed while 
> searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 3,
> "params": {
>   "indent": "true",
>   "q": "bizNameAr:(شرطة ازكي)",
>   "_": "1501678260335",
>   "wt": "json"
> }
>   },
>   "response": {
> "numFound": 444,
> "start": 0,
> "docs": [
>   {
> "id": "28107",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  
> -
> -
> مركز شرطة إزكي",
> "_version_": 1574621132849414100
>   },
>   {
> "id": "13937",
> "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
> "_version_": 157462113219720
>   },
>   {
> "id": "15914",
> "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
> "_version_": 1574621132344000500
>   },
>   {
> "id": "20639",
> "bizNameAr": "سحائب ازكي للتجارة",
> "_version_": 1574621132574687200
>   },
>   {
> "id": "25108",
> "bizNameAr": "المستشفيات -  - مستشفى إزكي",
> "_version_": 1574621132737216500
>   },
>   {
> "id": "27629",
> "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
> "_version_": 1574621132833685500
>   },
>   {
> "id": "36351",
> "bizNameAr": "طوارئ الكهرباء - إزكي",
> "_version_": 157462113318391
>   },
>   {
> "id": "61235",
> "bizNameAr": "اضواء ازكي للتجارة",
> "_version_": 1574621133785792500
>   },
>   {
> "id": "66821",
> "bizNameAr": "أطلال إزكي للتجارة",
> "_version_": 1574621133915816000
>   },
>   {
> "id": "67011",
> "bizNameAr": "بنك ظفار - فرع ازكي",
> "_version_": 1574621133920010200
>   }
> ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the 
> words that we typed while searching:
>
>   {
> "id": "28107",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  
> -
> -
> مركز شرطة إزكي",
> "_version_": 1574621132849414100
>   },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
>  stored="true"/>
>
>
>  positionIncrementGap="100">
>   
> 
>  words="lang/stopwords_ar.txt" />
> 
> 
> 
> 
>  pattern="ى"
> replacement="ئ"/>
>  pattern="ء"
> replacement=""/>
>   
> 
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Allison, Timothy B.
>4. Write an external program that fetches the file, fetches the metadata, 
>combines them, and send them to Solr.

I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
snap:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

With the caveat that Tika should really be in a separate vm in production [1].

[1] 
http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
 



RE: Solr 6.4. Can't index MS Visio vsdx files

2017-07-03 Thread Allison, Timothy B.
va:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source) 



SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://localhost:80/solr/v20170703xxx/update/extract?resource.name=xx
1 files indexed.
COMMITting Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350



On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the 
> upgrade in Solr to Tika 1.15.  Please chime in on that issue.
>
> You should be able to swap in POI 3.16 (final) wherever you had 
> earlier versions, make sure to include: poi, poi-scratchpad, 
> poi-ooxml, poi-ooxml-schemas.  And make sure to include tika-parsers 
> (1.15), tika-core, tika-java7, tika-xmp.  Also, include 
> commons-collections4 (which is new in POI w Tika 1.14).  (I assume you 
> have already added curvesapi?)
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Saturday, June 3, 2017 5:39 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Great Tim.
>
> What do I need to do to integrate it on my current installation?
>
>
> On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Apache Tika 1.15 is now available.
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Tuesday, May 9, 2017 7:45 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Probably better to ask on the Tika list.  We'll push the release asap 
> after PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate 
> for PDFBox this Friday.  Tika will probably have an RC by Monday 5/15, 
> with the release happening later in the week...That's if there are no 
> surprises...[2]
>
> You can get a recent build if you'd like to test [1].
>
> Best,
>
>   Tim
>
> [1] https://builds.apache.org/view/Tika/job/Tika-trunk/
> [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 
> and 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/ 
> reports/reports_pdfbox_2_0_6.tar.gz
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, May 9, 2017 7:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> Are there any news regarding Tika 1.15? Maybe it's already ready for 
> download somewhere
>
> G.
>
> On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > The release candidate for POI was just cut...unfortunately, I think 
> > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
> opening that!
> >
> > That'll be done within a week unless there are surprises.  Once 
> > that's out, I have to update a few things, but I'd think we'd have a 
> > candidate for Tika a week later, then a week for release.
> >
> > You can get nightly builds here: https://builds.apache.org/
> >
> > Please ask on the POI or Tika users lists for how to get the 
> > latest/latest running, and thank you, again, for opening the issue 
> > on
> POI's Bugzilla.
> >
> > Best,
> >
> >Tim
> >
> > -Original Message-
> > From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> > Sent: Wednesday, April 12, 2017 1:00 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > when 1.15 will be released? maybe you have some beta version and I 
> > could test it :)
> >
> > SAX sounds interesting, and from info that I found in google it 
> > could solve my issues.
> >
> > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> >
> > > It depends.  We've been trying to make parsers more, erm, 
> > > flexible, but there are some problems from which we cannot recover.
> > >
> > > Tl;dr there isn't a short answer.  :(
> > >
> > > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > > people up and running with Solr easily but it is not really a 
> > > great idea for production.  See Erick's gem: 
> > > https://lucidworks.com/2012/ 02/14/indexing-with-solrj/
> > >
> > > As for the Tika por

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
>http -  however, the big advantage of doing your indexing on different machine 
>is that the heavy lifting that tika does in extracting text from documents, 
>finding metadata etc is not happening on the server. If the indexer crashes, 
>it doesn’t affect Solr either.

+1 

for what can go wrong: 
http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
 

https://www.youtube.com/watch?v=vRPTPMwI53k=13s=43=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp

Really, we try our best on Tika, but sometimes bad things happen.  Let us know 
when they do, and we'll try to fix them.


RE: How are people using the ICUTokenizer?

2017-06-20 Thread Allison, Timothy B.
> So, if you are trying to make sure your index breaks words properly on 
> eastern languages, just use ICU Tokenizer.   

I defer to the expertise on this list, but last I checked ICUTokenizer uses 
dictionary lookup to tokenize CJK.  This may work well for some tasks, but I 
haven't evaluated whether it performs better than smartcn or even just 
cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to state "just 
use" and imply the problem is solved.  

I thought I remembered ICUTokenizer not playing well with the CJKBigramFilter, 
but it appears to be working in 6.6.

> use the ICUNormalizer
I could not agree with this more.  

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, June 20, 2017 12:02 PM
To: solr-user@lucene.apache.org
Subject: RE: How are people using the ICUTokenizer?

Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you 
are trying to make sure your index breaks words properly on eastern languages, 
just use ICU Tokenizer.   Unless your text is already in an ICU normal form, 
you should always use the ICUNormalizer character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop 
words, maybe in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc 
freq and term freq - my possibly naïve idea is that I do not need positions and 
offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query 
using the shingled versions of fields.

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was 
wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/


RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
Yeah, Chris knows a thing or two about Tika.  :)

-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com] 
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

No intention of spamming but I also want to mention tika-python 
 in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan  wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr) 
> because Python is my main programming language. I have an impression 
> that 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ 
> talk to the server via HTTP or some other more native ways? Is the 
> main benefit of SolrJ over other clients the official shipment with Solr? 
> Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan"  wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just 
>> trying to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
>> 
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a 
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of 
>>> problems there are when parsing all the different formats so I'd 
>>> _really_ follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all 
>>> these different formats, implemented by different vendors with 
>>> different versions that more or less follow a spec which really 
>>> isn't a spec in many cases just recommendations using packages that 
>>> may or may not be actively maintained. And by the way, we'll try to 
>>> handle that 1G document that someone sends us, but don't blame us if 
>>> we hit an OOM.". When Tika is run on the same box as Solr any 
>>> problems in that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU 
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for 
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content 
>>> > to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>>> > 
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >> Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for 
>>> >> > question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned 
>>> >> > JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post 
>>> >> > >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika
>>> >> />
>>> >> > :
>>> >> >
>>> >> > >> stored="true"/>
>>> >> > >> indexed="true"
>>> >> > stored="false"/>
>>> >> > 
>>> >> >
>>> >> > Is it so that Tika does not fill these two fields automatically 
>>> >> > and
>>> I
>>> >> have
>>> >> > to write some client code to fill them?
>>> >> >
>>> >> > Best regards,
>>> >> > Ziyuan
>>> >> >
>>> >> >
>>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>>> erickerick...@gmail.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> 1> Yes, you can use your single definition. The author 
>>> >> >> 1> identifies
>>> the
>>> >> >> "text" field as a catch-all. Somewhere in the schema there'll 
>>> >> >> be a copyField directive copying (perhaps) many different 
>>> >> >> fields to the "text" field. That permits simple searches 
>>> >> >> against a single field rather than, say, using edismax to 
>>> >> >> search across multiple separate fields.
>>> >> >>
>>> >> >> 2> The link you referenced is for 

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.
> There is no standard across different types of docs as to what meta-data 
> field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing.

On Tika, we _try_ to normalize fields according to various standards, the most 
predominant is Dublin core, so that "author" in one format and "creator" in 
another will both be mapped to "dc:creator".  That said:

1) there are plenty of areas where we could do a better job of normalizing.  
Please let us know how to improve!
2) no matter how well we normalize, there are some metadata items that are 
specific to various file formats...I strongly recommend running Tika against a 
representative batch of documents and deciding which fields you need for your 
application.

Finally, if there's a chance you want metadata from embedded 
documents/attachments, checkout the RecursiveParserWrapper.  Under legacy Tika, 
if you have a bunch of images in a zip file, you'd never get the lat/longs...or 
you'd never get "dc:creator" from an MSWord file sent as an attachment in an 
MSG file.

Finally, and I mean it this time, I heartily second Erik's point about SolrJ 
and the need to keep your file processing outside of Solr's JVM, VM and M!




-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Monday, June 19, 2017 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got 
schema and config and even UI for file indexing and searching.   Check it out 
README.txt under example/files in your Solr install.

Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: 
> more specifically I cannot see the field content in the returned JSON, 
> with the the same definitions as in the post 
>  ext-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
>  stored="true"/>  indexed="true"
> stored="false"/>
> 
> 
> Is it so that Tika does not fill these two fields automatically and I 
> have to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
> 
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a 
>> copyField directive copying (perhaps) many different fields to the 
>> "text" field. That permits simple searches against a single field 
>> rather than, say, using edismax to search across multiple separate 
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific 
>> fields matching your schema. Be a little careful here. There is no 
>> standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing. Here's a 
>> link to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can 
>> easily hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend 
>> you do the document processing on the client, the reasons are 
>> outlined in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the 
>> search results as keys in JSON are the fields set as stored="true"? 
>> They must be to be returned in requests (skipping the docValues 
>> discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot 
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of 
>> things, most particularly whether that text is ever actually in a 
>> field in your index. Just because there's no guarantee that the name 
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be 
>> parsed as id:Trevor _text_:Hastie _text_ is the default field, look 
>> for a "df" parameter in your request handler in solrconfig.xml 
>> (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan  wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some 
>>> PDF files. The indexing part works out of the box by using bin/post. 
>>> I can

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-06-05 Thread Allison, Timothy B.
https://issues.apache.org/jira/browse/SOLR-10335 is tracking the upgrade in 
Solr to Tika 1.15.  Please chime in on that issue.

You should be able to swap in POI 3.16 (final) wherever you had earlier 
versions, make sure to include: poi, poi-scratchpad, poi-ooxml, 
poi-ooxml-schemas.  And make sure to include tika-parsers (1.15), tika-core, 
tika-java7, tika-xmp.  Also, include commons-collections4 (which is new in POI 
w Tika 1.14).  (I assume you have already added curvesapi?)

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Saturday, June 3, 2017 5:39 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Great Tim.

What do I need to do to integrate it on my current installation?


On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote:

Apache Tika 1.15 is now available.

-Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/ 
reports/reports_pdfbox_2_0_6.tar.gz

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
opening that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on
POI's Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> &

Re: XLSB files not indexed

2017-05-31 Thread Allison, Timothy B.
Apache Tika version 1.15 now handles XLSB files.  The behavior described below 
is the expected behavior if a file type is identified but there is no parser to 
handle that file type.  

A little late to the game, I admit... :)

Cheers,

   Tim
 
FromRoland Everaert 
Subject Re: XLSB files not indexed
DateMon, 21 Oct 2013 07:59:20 GMTHi Otis,

In our case, there is no exception raised by tika or solr, a lucene
document is created, but the content field contains only a few white spaces
like for ODF files.


Roland.


On Sat, Oct 19, 2013 at 3:54 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi Roland,
>
> It looks like:
> Tika - yes
> Solr - no?
>
> Based on http://search-lucene.com/?q=xlsb
>
> ODF != XLSB though, I think...
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Fri, Oct 18, 2013 at 7:36 AM, Roland Everaert 
> wrote:
> > Hi,
> >
> > Can someone tells me if tika is supposed to extract data from xlsb files
> > (the new MS Office format in binary form)?
> >
> > If so then it seems that solr is not able to index them like it is not
> able
> > to index ODF files (a JIRA is already opened for ODF
> > https://issues.apache.org/jira/browse/SOLR-4809)
> >
> > Can someone confirm the problem, or tell me what to do to make solr works
> > with XLSB files.
> >
> >
> > Regards,
> >
> >
> > Roland.
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-31 Thread Allison, Timothy B.
Apache Tika 1.15 is now available.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: 
http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz
 
-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening 
> that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on POI's 
> Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a 
> > pure SAX parser for vsdx.  We just did this for docx and pptx 
> > (coming in 1.15) and these are more robust to variation because they 
> > aren't requiring a match with the ooxml schema.  I haven't looked 
> > much at vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would 
> > require contributions.  However, I agree that POI shouldn't throw a 
> > Runtime exception.  Perhaps open an issue in POI, or maybe we should 
> > catch this special example at the Tika level?
> >
> > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-05-09 Thread Allison, Timothy B.
Probably better to ask on the Tika list.  We'll push the release asap after 
PDFBox 2.0.6 is out.  Andreas plans to cut the release candidate for PDFBox 
this Friday.  Tika will probably have an RC by Monday 5/15, with the release 
happening later in the week...That's if there are no surprises...[2]

You can get a recent build if you'd like to test [1].

Best,

  Tim

[1] https://builds.apache.org/view/Tika/job/Tika-trunk/
[2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and 
2.0.6-SNAPSHOT on ~500k pdfs, see: 
http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz
 
-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Tuesday, May 9, 2017 7:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Are there any news regarding Tika 1.15? Maybe it's already ready for download 
somewhere

G.

On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> The release candidate for POI was just cut...unfortunately, I think 
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening 
> that!
>
> That'll be done within a week unless there are surprises.  Once that's 
> out, I have to update a few things, but I'd think we'd have a 
> candidate for Tika a week later, then a week for release.
>
> You can get nightly builds here: https://builds.apache.org/
>
> Please ask on the POI or Tika users lists for how to get the 
> latest/latest running, and thank you, again, for opening the issue on POI's 
> Bugzilla.
>
> Best,
>
>Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Wednesday, April 12, 2017 1:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> when 1.15 will be released? maybe you have some beta version and I 
> could test it :)
>
> SAX sounds interesting, and from info that I found in google it could 
> solve my issues.
>
> On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > It depends.  We've been trying to make parsers more, erm, flexible, 
> > but there are some problems from which we cannot recover.
> >
> > Tl;dr there isn't a short answer.  :(
> >
> > My sense is that DIH/ExtractingDocumentHandler is intended to get 
> > people up and running with Solr easily but it is not really a great 
> > idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> > 02/14/indexing-with-solrj/
> >
> > As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> > the ingesting process to crash.  At most, it should fail at the file 
> > level and not cause greater havoc.  In practice, if you're 
> > processing millions of files from the wild, you'll run into bad 
> > behavior and need to defend against permanent hangs, oom, memory leaks.
> >
> > Also, at the least, if there's an exception with an embedded file, 
> > Tika should catch it and keep going with the rest of the file.  If 
> > this doesn't happen let us know!  We are aware that some types of 
> > embedded file stream problems were causing parse failures on the 
> > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> > let them percolate up through the parent file (they're reported in 
> > the
> metadata though).
> >
> > Specifically for your stack traces:
> >
> > For your initial problem with the missing class exceptions -- I 
> > thought we used to catch those in docx and log them.  I haven't been 
> > able to track this down, though.  I can look more if you have a need.
> >
> > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > name 'PolylineTo' ", this problem might go away if we implemented a 
> > pure SAX parser for vsdx.  We just did this for docx and pptx 
> > (coming in 1.15) and these are more robust to variation because they 
> > aren't requiring a match with the ooxml schema.  I haven't looked 
> > much at vsdx, but that _might_ help.
> >
> > For "TODO Support v5 Pointers", this isn't supported and would 
> > require contributions.  However, I agree that POI shouldn't throw a 
> > Runtime exception.  Perhaps open an issue in POI, or maybe we should 
> > catch this special example at the Tika level?
> >
> > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 
> > team _might_ be able to modify the parser to ignore a stream if 
> > there's an exception, but that's often a sign that something needs 
> > to be fixed with the parser.  In short, the solution will come from POI.
&g

RE: keyword-in-content for PDF document

2017-04-13 Thread Allison, Timothy B.
If you don't care about sentence boundaries, but just want a window around 
target terms and you want concordance functionality (sort before, after, etc), 
you might check out LUCENE-5317, which is available as a standalone jar on my 
github site [1] and is available through maven central.

Using a highlighter, too, will get you close.

See a crummy image of LUCENE-5317 [2] or the full presentation [3]

[1] https://github.com/tballison/lucene-addons/tree/6.5-0.1
[2] https://twitter.com/_tallison/status/852492398793981952
[3] 
https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf
 slide 23ff.


-Original Message-
From: ankur [mailto:ankur.sancheti.netw...@gmail.com] 
Sent: Thursday, April 13, 2017 12:08 PM
To: solr-user@lucene.apache.org
Subject: Re: keyword-in-content for PDF document

Thanks Alex. Yes, I am using TIKA. So, to some extent it preserves the text
flow.

There is something interesting in your reply, "Or you could try using
highlighter to return only 
the sentence. ".

I didnt understand that bit. How do we use Highlighter to return the
sentence?

To make sure, I want to return all sentences where the word "Growth"
appears. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/keyword-in-context-for-PDF-document-tp4329754p4329794.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-12 Thread Allison, Timothy B.
The release candidate for POI was just cut...unfortunately, I think after Nick 
Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that!

That'll be done within a week unless there are surprises.  Once that's out, I 
have to update a few things, but I'd think we'd have a candidate for Tika a 
week later, then a week for release.

You can get nightly builds here: https://builds.apache.org/

Please ask on the POI or Tika users lists for how to get the latest/latest 
running, and thank you, again, for opening the issue on POI's Bugzilla.

Best,

   Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Wednesday, April 12, 2017 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

when 1.15 will be released? maybe you have some beta version and I could test 
it :)

SAX sounds interesting, and from info that I found in google it could solve my 
issues.

On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> It depends.  We've been trying to make parsers more, erm, flexible, 
> but there are some problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get 
> people up and running with Solr easily but it is not really a great 
> idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> 02/14/indexing-with-solrj/
>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> the ingesting process to crash.  At most, it should fail at the file 
> level and not cause greater havoc.  In practice, if you're processing 
> millions of files from the wild, you'll run into bad behavior and need 
> to defend against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, 
> Tika should catch it and keep going with the rest of the file.  If 
> this doesn't happen let us know!  We are aware that some types of 
> embedded file stream problems were causing parse failures on the 
> entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> let them percolate up through the parent file (they're reported in the 
> metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I 
> thought we used to catch those in docx and log them.  I haven't been 
> able to track this down, though.  I can look more if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' 
> name 'PolylineTo' ", this problem might go away if we implemented a 
> pure SAX parser for vsdx.  We just did this for docx and pptx (coming 
> in 1.15) and these are more robust to variation because they aren't 
> requiring a match with the ooxml schema.  I haven't looked much at 
> vsdx, but that _might_ help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require 
> contributions.  However, I agree that POI shouldn't throw a Runtime 
> exception.  Perhaps open an issue in POI, or maybe we should catch 
> this special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 
> team _might_ be able to modify the parser to ignore a stream if 
> there's an exception, but that's often a sign that something needs to 
> be fixed with the parser.  In short, the solution will come from POI.
>
> Best,
>
>  Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any 
> exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> > You might want to drop a note to the dev or user's list on Apache POI.
> >
> > I'm not extremely familiar with the vsd(x) portion of our code base.
> >
> > The first item ("PolylineTo") may be caused by a mismatch btwn your 
> > doc and the ooxml spec.
> >
> > The second item appears to be an unsupported feature.
> >
> > The third item may be an area for improvement within our 
> > codebase...I can't tell just from the stacktrace.
> >
> > You'll probably get more helpful answers over on POI.  Sorry, I 
> > can't help with this...
> >
> > Best,
> >
> >Tim
> >
> > P.S.
> > >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
> >
> > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set 
> > of poi-ooxml-schemas-3.15.jar
> >
> >
> >
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
It depends.  We've been trying to make parsers more, erm, flexible, but there 
are some problems from which we cannot recover.

Tl;dr there isn't a short answer.  :(

My sense is that DIH/ExtractingDocumentHandler is intended to get people up and 
running with Solr easily but it is not really a great idea for production.  See 
Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/ 

As for the Tika portion... at the very least, Tika _shouldn't_ cause the 
ingesting process to crash.  At most, it should fail at the file level and not 
cause greater havoc.  In practice, if you're processing millions of files from 
the wild, you'll run into bad behavior and need to defend against permanent 
hangs, oom, memory leaks.

Also, at the least, if there's an exception with an embedded file, Tika should 
catch it and keep going with the rest of the file.  If this doesn't happen let 
us know!  We are aware that some types of embedded file stream problems were 
causing parse failures on the entire file, and we now catch those in Tika 
1.15-SNAPSHOT and don't let them percolate up through the parent file (they're 
reported in the metadata though).

Specifically for your stack traces:

For your initial problem with the missing class exceptions -- I thought we used 
to catch those in docx and log them.  I haven't been able to track this down, 
though.  I can look more if you have a need.

For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 
'PolylineTo' ", this problem might go away if we implemented a pure SAX parser 
for vsdx.  We just did this for docx and pptx (coming in 1.15) and these are 
more robust to variation because they aren't requiring a match with the ooxml 
schema.  I haven't looked much at vsdx, but that _might_ help.

For "TODO Support v5 Pointers", this isn't supported and would require 
contributions.  However, I agree that POI shouldn't throw a Runtime exception.  
Perhaps open an issue in POI, or maybe we should catch this special example at 
the Tika level?

For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team 
_might_ be able to modify the parser to ignore a stream if there's an 
exception, but that's often a sign that something needs to be fixed with the 
parser.  In short, the solution will come from POI.

Best,

 Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Tuesday, April 11, 2017 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files

Thanks for your responses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception

On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote:

> You might want to drop a note to the dev or user's list on Apache POI.
>
> I'm not extremely familiar with the vsd(x) portion of our code base.
>
> The first item ("PolylineTo") may be caused by a mismatch btwn your 
> doc and the ooxml spec.
>
> The second item appears to be an unsupported feature.
>
> The third item may be an area for improvement within our codebase...I 
> can't tell just from the stacktrace.
>
> You'll probably get more helpful answers over on POI.  Sorry, I can't 
> help with this...
>
> Best,
>
>Tim
>
> P.S.
> >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
>
> You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set 
> of poi-ooxml-schemas-3.15.jar
>
>
>


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-11 Thread Allison, Timothy B.
You might want to drop a note to the dev or user's list on Apache POI.

I'm not extremely familiar with the vsd(x) portion of our code base.

The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the 
ooxml spec.

The second item appears to be an unsupported feature.

The third item may be an area for improvement within our codebase...I can't 
tell just from the stacktrace.

You'll probably get more helpful answers over on POI.  Sorry, I can't help with 
this...

Best,

   Tim

P.S.
>  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar

You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set of 
poi-ooxml-schemas-3.15.jar




RE: Japanese character is garbled when using TikaEntityProcessor

2017-04-10 Thread Allison, Timothy B.
Please open an issue on Tika's JIRA and share the triggering file if possible.  
If we can touch the file, we may be able to recommend alternate ways to 
configure Tika's encoding detectors.  We just added configurability to the 
encoding detectors and that will be available with Tika 1.15. [1]

We use a fallback set of detectors: html, universalchardet, icu4j.  Whichever 
one has a non-null answer, we go with that.  This is perhaps not the best 
option, but that's what we've been doing for a while. We are in the process of 
reassessing our current methods[2], but that will take some time.

[1] https://issues.apache.org/jira/browse/TIKA-2273
[2] https://issues.apache.org/jira/browse/TIKA-2038

-Original Message-
From: Noriyuki TAKEI [mailto:nta...@sios.com] 
Sent: Monday, April 10, 2017 1:46 PM
To: solr-user@lucene.apache.org
Subject: Japanese character is garbled when using TikaEntityProcessor

Hi,All

I use TikaEntityProcessor to extract the text content from binary or text file.

But when I try to extract Japanese Characters from HTML File whose caharacter 
encoding is SJIS, the content is garbled.In the case of UTF-8,it does work well.

The setting of Data Import Handler is as below.

--- from here ---

  
  

  

  
  

  


  

  

  

--- to here ---

How do I solve this?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr performance issue on indexing

2017-04-04 Thread Allison, Timothy B.
>  Also we will try to decouple tika to solr.
+1


-Original Message-
From: tstusr [mailto:ulfrhe...@gmail.com] 
Sent: Friday, March 31, 2017 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr performance issue on indexing

Hi, thanks for the feedback.

Yes, it is about OOM, indeed even solr instance makes unavailable. As I was 
saying I can't find more relevant information on logs.

We're are able to increment JVM amout, so, the first thing we'll do will be 
that.

As far as I know, all documents are bounded to that amount (14K), just the 
processing could change. We are making some tests on indexing and it seems it 
works without concurrent threads. Also we will try to decouple tika to solr.

By the way, make it available with solr cloud will improve performance? Or 
there will be no perceptible improvement?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Allison, Timothy B.
> Note that the OCRing is a separate task from Solr indexing, and is best done 
> on separate machines.

+1

-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com] 
Sent: Thursday, March 30, 2017 7:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR

The workflow is
-/ OCR new documents
-/ check quality and tune until you get good output text -/ keep the output 
text in the file system

-/ index and re-index to Solr as necessary from the file system 

Note that the OCRing is a separate task from Solr indexing, and is best done on 
separate machines. I used all the old 'surplus' servers for OCR.
Cheers -- Rick
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
See also:

http://stackoverflow.com/a/39792337/6281268

This includes jai.

Most importantly: be aware of the licensing implications of using levigo and 
jai.  If they had been Apache 2.0 compatible, we would have included them.

Finally, there's a new option (coming out in Tika 1.15) that renders each PDF 
page as a single image before running OCR on it.  We found a couple of crazy 
PDFs that had 1000s of images where a single image was used to represent one 
line in a table (and I don't mean row, I mean a literal line in a table).

That "new" option is documented on our wiki:

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

Finally (I mean it this time), I've updated our wiki to mention the two 
optional dependencies.  Thank you.

Cheers,

  Tim

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Monday, March 27, 2017 11:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

I tried this solution from Tim Allison, and it works.

http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files

Regards,
Edwin

On 27 March 2017 at 20:07, Allison, Timothy B. <talli...@mitre.org> wrote:

> Please also see:
>
> https://wiki.apache.org/tika/TikaOCR
>
> and
>
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
>
> If you have any other questions about Apache Tika and OCR, please feel 
> free to ask on our users list as well: u...@tika.apache.org
>
> Cheers,
>
>Tim
>
> -Original Message-
> From: Arian Pasquali [mailto:arianpasqu...@gmail.com]
> Sent: Sunday, March 26, 2017 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Index scanned documents
>
> Hi Walled,
>
> I've never done that with solr, but you would probably need to use 
> some OCR preprocessing before indexing.
> The most popular library I know for the job is tesseract-orc < 
> https://github.com/tesseract-ocr>.
>
> If you want to do that inside solr I've found that Tika has some 
> support for that too.
> Take a look Vijay Mhaskar's post on how to do this using TikaOCR
>
> http://blog.thedigitalgroup.com/vijaym/using-solr-and-
> tikaocr-to-search-text-inside-an-image/
>
> I hope that guides you
>
> Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
> waleed.raza.parhi...@gmail.com> escreveu:
>
> > Hello
> > I want to ask you that how can we extract text in solr from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> > waleed.raza.parhi...@gmail.com
> > > wrote:
> >
> > > Hello
> > > I want to ask you that how can we extract in solr text from images 
> > > which are inside pdf and MS office documents ?
> > > i found many websites but did not get a reply of it please guide me.
> > >
> > >
> >
> --
> [image: INESC TEC]
>
> *Arian Rodrigo Pasquali*
> Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
> Artificial Intelligence and Decision Support
>
> *INESC TEC*
> Campus da FEUP
> Rua Dr Roberto Frias
> 4200-465 Porto
> Portugal
>
> T +351 22 040 2963
> F +351 22 209 4050
> arian.r.pasqu...@inesctec.pt
> www.inesctec.pt
>


RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
Please also see: 

https://wiki.apache.org/tika/TikaOCR

and

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

If you have any other questions about Apache Tika and OCR, please feel free to 
ask on our users list as well: u...@tika.apache.org

Cheers,

   Tim

-Original Message-
From: Arian Pasquali [mailto:arianpasqu...@gmail.com] 
Sent: Sunday, March 26, 2017 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

Hi Walled,

I've never done that with solr, but you would probably need to use some OCR 
preprocessing before indexing.
The most popular library I know for the job is tesseract-orc 
.

If you want to do that inside solr I've found that Tika has some support for 
that too.
Take a look Vijay Mhaskar's post on how to do this using TikaOCR

http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/

I hope that guides you

Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
waleed.raza.parhi...@gmail.com> escreveu:

> Hello
> I want to ask you that how can we extract text in solr from images 
> which are inside pdf and MS office documents ?
> i found many websites but did not get a reply of it please guide me.
>
> On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> waleed.raza.parhi...@gmail.com
> > wrote:
>
> > Hello
> > I want to ask you that how can we extract in solr text from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> >
>
--
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
Artificial Intelligence and Decision Support

*INESC TEC*
Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto
Portugal

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasqu...@inesctec.pt
www.inesctec.pt


Testing an ingest framework that uses Apache Tika

2017-02-16 Thread Allison, Timothy B.
All,

I finally got around to documenting Apache Tika's MockParser[1].  As of Tika 
1.15 (unreleased), add tika-core-tests.jar to your class path, and you can 
simulate:

1. Regular catchable exceptions
2. OOMs
3. Permanent hangs

This will allow you to determine if your ingest framework is robust against 
these issues.

As always, we fix Tika when we can, but if history is any indicator, you'll 
want to make sure your ingest code can handle these issues if you are handling 
millions/billions of files from the wild.

Cheers,

Tim


[1] https://wiki.apache.org/tika/MockParser


RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

2017-02-08 Thread Allison, Timothy B.
>It is *strongly* recommended to *not* use >the Tika that's embedded within 
>Solr, but >instead to do the processing outside of Solr >in a program of your 
>own and index the results.  

+1 

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C7F30%40BY2PR09MB112.namprd09.prod.outlook.com%3E
 


RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Shouldn't have taken you that much effort.  Sorry.

Y, I should probably get around to a patch for: 
https://issues.apache.org/jira/browse/SOLR-9552

Although, frankly, it might be time for Tika 1.15 shortly.

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Monday, February 6, 2017 11:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

Tim, you saved my day ;)

now vsdx files were indexed successfully.

Thank you very much!!!

summary: as a workaround I have in solr-6.4.0\contrib\extraction\lib:

1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2. 
curvesapi-1.03.jar


So, now I'm waiting when this will be implemented in a official version of 
solr/tika.

Regards,
Gytis

On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.
>
> For now, add this jar:
> https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
>
> See also [1]
>
> [1] http://apache-poi.1045710.n5.nabble.com/support-for-
> reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Monday, February 6, 2017 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> sad, but didn't help.
>
> what I did:
>
> 1. stopped solr: bin\solr stop -p 80
> 2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3. 
> add ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr: 
> bin\solr start -p 80 -m 4g 5. tried again to parse vsdx file:
>
> java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx 
> -Drecursive=yes -jar example/exampledocs/post.jar "I:\Tools"
>
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:80/solr/db_new02/update...
> Entering auto mode. File endings considered are vsd,vsdx Entering 
> recursive mode, max depth=999, delay=0s Indexing directory I:\Tools (1 
> files, depth=0) POSTing file span ports.vsdx 
> (application/octet-stream) to [base]/extract
> SimplePostTool: WARNING: Solr returned an error #500 (Server Error) 
> for
> url:
> http://localhost:80/solr/db_new02/update/extract?resource.
> name=I%3A%5CTools%5Cspan+ports.vsdx
> SimplePostTool: WARNING: Response:http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /solr/db_new02/update/extract. Reason:
> Server ErrorCaused
> by:java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
> at java.lang.Class.getConstructor0(Unknown Source)
> at java.lang.Class.getDeclaredConstructor(Unknown Source)
> at org.apache.poi.xdgf.util.ObjectFactory.put(
> ObjectFactory.java:34)
> at
> org.apache.poi.xdgf.usermodel.section.geometry.
> GeometryRowFactory.clinit(GeometryRowFactory.java:39)
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.
> init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
> XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(
> XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(
> XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
> XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(
> XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init&
> gt;(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(
> ExtractorFactory.java:207)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(
> OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
> parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.ja

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Argh.  Looks like we need to add curvesapi (BSD 3-clause) to Solr.

For now, add this jar:
https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03 

See also [1]

[1] 
http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Monday, February 6, 2017 8:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

sad, but didn't help.

what I did:

1. stopped solr: bin\solr stop -p 80
2. removed poi-ooxml-schemas-3.15.jar from contrib\extraction\lib 3. add 
ooxml-schemas-1.3.jar to contrib\extraction\lib 4. restarted solr: bin\solr 
start -p 80 -m 4g 5. tried again to parse vsdx file:

java -Dauto -Dc=db_new02 -Dport=80 -Dfiletypes=vsd,vsdx -Drecursive=yes -jar 
example/exampledocs/post.jar "I:\Tools"

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:80/solr/db_new02/update...
Entering auto mode. File endings considered are vsd,vsdx Entering recursive 
mode, max depth=999, delay=0s Indexing directory I:\Tools (1 files, depth=0) 
POSTing file span ports.vsdx (application/octet-stream) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #500 (Server Error) for
url:
http://localhost:80/solr/db_new02/update/extract?resource.name=I%3A%5CTools%5Cspan+ports.vsdx
SimplePostTool: WARNING: Response:   
Error 500 Server Error

HTTP ERROR 500
Problem accessing /solr/db_new02/update/extract. Reason:
Server ErrorCaused
by:java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getDeclaredConstructor(Unknown Source)
at org.apache.poi.xdgf.util.ObjectFactory.put(ObjectFactory.java:34)
at
org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.clinit(GeometryRowFactory.java:39)
at
org.apache.poi.xdgf.usermodel.section.GeometrySection.init(GeometrySection.java:55)
at
org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
at
org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
at
org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
at
org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)
at
org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)
at
org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)
at
org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at
org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(XmlVisioDocument.java:79)
at
org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init(XDGFVisioExtractor.java:41)
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:207)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-06 Thread Allison, Timothy B.
Ah, ConnectsType.  That's fixed in the most recent version of POI [1], and will 
soon be fixed in Tika [2].  So, no need to open a ticket on Tika's Jira.

> as tika is failing, is it could help or not?

Y, that will absolutely help.  In your Solr contrib/extract/lib directory, 
you'll see poi-ooxml-schemas-3.xx.jar.  Remove that jar and add 
ooxml-schemas.jar [3].  As documented in [4], poi-ooxml-schemas is a subset of 
the much larger (complete) ooxml-schemas; ConnectsType was not in the subset, 
but it _should_ be in ooxml-schemas.

Cheers,

 Tim



[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=60489
[2] https://issues.apache.org/jira/browse/TIKA-2208 
[3] https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas/1.3 
[4] http://poi.apache.org/faq.html#faq-N10025 


Hi again,

I've tried with tika-app - didn't help

java -jar tika-app-1.14.jar "I:\Dat\span ports.vsdx"
Exception in thread "main" java.lang.NoClassDefFoundError:
com/microsoft/schemas/office/visio/x2012/main/ConnectsType
at com.microsoft.schemas.office.visio.x2012.main.impl.
PageContentsTypeImpl.getConnects(Unknown Source)
at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(
XDGFBaseContents.java:89)
at org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(
XDGFPageContents.java:73)
at org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(
XDGFPages.java:94)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(
XmlVisioDocument.java:108)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(
XmlVisioDocument.java:79)
at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(
XDGFVisioExtractor.java:41)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(
ExtractorFactory.java:207)
at org.apache.tika.parser.microsoft.ooxml.
OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.
parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(
AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.
visio.x2012.main.ConnectsType
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 17 more


So next step is to open bug ticket on tika's jira.


And what about with your proposed workaround?
"If this is a missing bean issue (sorry, I can't tell from your stacktrace 
which class is missing), as a temporary workaround, you can rm 
"poi-ooxml-schemas" and add the full "ooxml-schemas", and you should be good to 
go. [3]"

as tika is failing, is it could help or not?

Gytis


On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a 
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar 
>
> If the problem is fixed, we'll try to upgrade dependencies in Solr.  
> If it isn't fixed, please open a bug on Tika's Jira.
>
> If this is a missing bean issue (sorry, I can't tell from your 
> stacktrace which class is missing), as a temporary workaround, you can 
> rm "poi-ooxml-schemas" and add the full "ooxml-schemas", and you 
> should be good to go. [3]
>
> Cheers,
>
>   Tim
>
> [1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar
>
> [2] https://builds.apache.org/job/Tika-trunk/1193/org.apache.
> tika$tika-app/artifact/org.apache.tika/tika-app/1.15-
> 20170202.203920-124/tika-app-1.15-20170202.203920-124.jar
>
> [3] http://poi.apache.org/faq.html#faq-N10025
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, February 3, 2017 9:49 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> This kind of information extraction comes from Apache Tika that is 
> shipped with Solr. However Solr does not ship every possible parser 
> with its installation. So, I think you are hitting Tika where it 
> manages to figure out what type of content you have, but does not have 
> (Apache POI - another O/S project) library installed.
>
> What you need 

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-02-03 Thread Allison, Timothy B.
This is a Tika/POI problem.  Please download tika-app 1.14 [1] or a nightly 
version of Tika [2] and run 

java -jar tika-app.jar 

If the problem is fixed, we'll try to upgrade dependencies in Solr.  If it 
isn't fixed, please open a bug on Tika's Jira.

If this is a missing bean issue (sorry, I can't tell from your stacktrace which 
class is missing), as a temporary workaround, you can rm "poi-ooxml-schemas" 
and add the full "ooxml-schemas", and you should be good to go. [3]

Cheers,

  Tim

[1] http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.14.jar 

[2] 
https://builds.apache.org/job/Tika-trunk/1193/org.apache.tika$tika-app/artifact/org.apache.tika/tika-app/1.15-20170202.203920-124/tika-app-1.15-20170202.203920-124.jar

[3] http://poi.apache.org/faq.html#faq-N10025

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Friday, February 3, 2017 9:49 AM
To: solr-user 
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

This kind of information extraction comes from Apache Tika that is shipped with 
Solr. However Solr does not ship every possible parser with its installation. 
So, I think you are hitting Tika where it manages to figure out what type of 
content you have, but does not have (Apache POI - another O/S project) library 
installed.

What you need to do is to get the additional jar from Tika/POI's 
project/download and make it visible to Solr (probably as an extension jar in a 
lib folder somewhere - I am a bit hazy on that for latest Solr).

The version of Tika that Solr uses is part of the changes notes. For 6.4, it is 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/solr/CHANGES.txt
and it is Tika 1.13

Hope it helps,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 3 February 2017 at 05:57, Gytis Mikuciunas  wrote:
> Hi,
>
>
> I'm using single core Solr 6.4 instance on windows server (windows 
> server
> 2012 R2 standard),
> Java v8, (build 1.8.0_121-b13).
>
> All works more or less ok, except MS Visio vsdx files indexing.
>
>
> Every time it throws an error (no matters if it tries to index vsdx 
> file or for example docx with visio diagram inside).
>
> Thx in advance for your help. If you need some additional info, please ask.
>
>
> Error/Exception from log:
>
>
>  Null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> Could not initialize class 
> org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory
> at
> org.apache.poi.xdgf.usermodel.section.GeometrySection.init(GeometrySection.java:55)
> at
> org.apache.poi.xdgf.usermodel.XDGFSheet.init(XDGFSheet.java:77)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:113)
> at
> org.apache.poi.xdgf.usermodel.XDGFShape.init(XDGFShape.java:107)
> at
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66)
> at
> org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106)
> at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
> at
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.init(XmlVisioDocument.java:79)
> at
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.init(XDGFVisioExtractor.java:41)
> at
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
> at
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> at
> 

RE: Zip Bomb Exception in HTML File

2017-01-04 Thread Allison, Timothy B.
This came up back in September [1] and [2].  Same trigger...crazy number of 
divs.  

I think we could modify the AutoDetectParser to enable configuration of maximum 
zip-bomb depth via tika-config.

If there's any interest in this, re-open TIKA-2091, and I'll take a look.

Best,

Tim

[1] http://git.net/ml/solr-user.lucene.apache.org/2016-09/msg00561.html
[2] https://issues.apache.org/jira/browse/TIKA-2091

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, January 4, 2017 12:20 PM
To: solr-user 
Subject: Re: Zip Bomb Exception in HTML File

You might get a more knowledgeable response from the Tika folks, that's really 
not something Solr controls.


Best,
Erick

On Wed, Jan 4, 2017 at 8:50 AM,   wrote:
> i get an exception 

RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.
> I don't see any weird character when I manual copy it to any text editor.

That's a good diagnostic step, but there's a chance that Adobe (or your viewer) 
got it right, and Tika or PDFBox isn't getting it right.

If you run tika-app on the file [0], do you get the same problem?  See our stub 
on common text extraction challenges with PDFs [1] and how to run PDFBox's 
ExtractText against your file [2].

[0] java -jar tika-app.jar -i  -o 
[1] https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
[2] https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems 

-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com] 
Sent: Monday, December 12, 2016 10:55 AM
To: solr-user@lucene.apache.org; Ahmet Arslan 
Subject: Re: Unicode Character Problem

Hi Ahmet,

I don't see any weird character when I manual copy it to any text editor.

On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan 
wrote:

> Hi Furkan,
>
> I am pretty sure this is a pdf extraction thing.
> Turkish characters caused us trouble in the past during extracting 
> text from pdf files.
> You can confirm by performing manual copy-paste from original pdf file.
>
> Ahmet
>
>
> On Friday, December 9, 2016 8:44 PM, Furkan KAMACI 
> 
> wrote:
> Hi,
>
> I'm trying to index Turkish characters. These are what I see at my 
> index (I see both of them at different places of my content):
>
> aç  klama
> açıklama
>
> These are same words but indexed different (same weird character at 
> first one). I see that there is not a weird character when I check the 
> original PDF file.
>
> What do you think about it. Is it related to Solr or Tika?
>
> PS: I use text_general for analyser of content field.
>
> Kind Regards,
> Furkan KAMACI
>


RE: negation search help

2016-11-23 Thread Allison, Timothy B.
You've gotten far better answers on this already, but you can use the 
SpanNotQuery in the SpanQueryParser I maintain and have published to maven 
central [1][2][3].

This does not carry out any nlp, but this would allow literal "headache (no 
not)"!~5,0 -> "headache" but not if "no" or "not" shows up within 5 words 
before. 

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://github.com/tballison/lucene-addons/tree/master/solr-5410 
[3] 
http://search.maven.org/#artifactdetails%7Corg.tallison.lucene%7Clucene-addons%7C6.3-0.1%7Cpom
 


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Wednesday, November 23, 2016 10:03 AM
To: solr-user 
Subject: Re: negation search help

Well, then 'no' becomes a signal token. So, the question is how many tokens 
after that it affects in its circle of negation?

You could probably use something like
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-SurroundQueryParser
to say (if user said 'headache').
-{!surround} 3w(not, headache)

But I am not sure how this would work in terms of multi-term queries.

Alternatively, you could transform your input with custom token filter that, 
after seeing the term 'no', 'not', will just eat that and next n? tokens.

Or you could run the sentences through natural language recognition and 
remove/mark noun phrases that are negative.

What I am trying to say is that Solr can do a bunch of different things for 
you. But you first need to translate your domain problem into a much lower 
level pseudo-language problem that addresses your needs. Including the 
edge-cases, which none of us can guess from your description. Then you can 
implement it in Solr.

Hope this helps,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 24 November 2016 at 01:43, Hem Naidu
 wrote:
> Correct Alex. The use case is when provider searches on patient medical 
> information for certain symptoms, the mentions likes "no headache" , "no 
> blood loss", "not diabetic" should not show up in the search results.
>
> Thanks
>
>
> -Original Message-
> From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
> Sent: Wednesday, November 23, 2016 8:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: negation search help
>
> Now that I read better, do you mean that at indexing time those negations 
> must be recognized, in the way that they are no match ?
>
> Cheers
>
> On Wed, Nov 23, 2016 at 2:20 PM, Alessandro Benedetti < 
> benedetti.ale...@gmail.com> wrote:
>
>> Hi Hem,
>> are you expecting Solr to parse your natural language query out of 
>> the box ?
>> Are you using any custom query parser ?
>>
>> If not, you need to follow the lucene Syntax to define engative queries.
>>
>> And be careful to the edge cases [1] .
>>
>> Cheers
>>
>> [1] https://wiki.apache.org/solr/NegativeQueryProblems
>>
>> On Wed, Nov 23, 2016 at 1:54 PM, Hem Naidu > invalid> wrote:
>>
>>> Alex
>>>
>>> Whenever the keywords or sentence followed by "no", "not", etc 
>>> should be excluded from the search results. Does solr support this feature?
>>>
>>> Thanks
>>>
>>>
>>> Sent from my iPhone
>>>
>>>
>>> > On Nov 23, 2016, at 12:09 AM, Alexandre Rafalovitch 
>>> > 
>>> wrote:
>>> >
>>> > How do you _know_ it is not 'apparent' ? Is it because it is 
>>> > preceded by the keyword 'no'? Just that keyword? At what maximum distance?
>>> >
>>> > Regards,
>>> >   Alex
>>> >
>>> > On 23 Nov 2016 2:59 PM, "Hem Naidu"
>>> > 
>>> > wrote:
>>> >
>>> >> Gurus,
>>> >>
>>> >> I am new to Solr, I have a requirement to index entire pdf/word
>>> documents
>>> >> using Solr Tika. Which was successful and able to get the search
>>> results
>>> >> displayed. Now I need to fine tune the results or adjust index so 
>>> >> the negative statements should be filtered out the results like 
>>> >> my input
>>> text
>>> >> for index from the documents would be
>>> >> ---
>>> >> Fortunately no concurrent trauma was found In no apparent 
>>> >> distress
>>> >> --
>>> >>
>>> >> If user searches for concurrent trauma or distress the search 
>>> >> engine
>>> should
>>> >> filter out the results as it not apparent symptom.
>>> >>
>>> >> Any help on whether Solr can do this?
>>> >> If so, do I need to adjust the index or build custom queries?
>>> >>
>>> >> Any help on this would be greatly appreciated !
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >>
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card - http://about.me/alessandro_benedetti
>> Blog - http://alexbenedetti.blogspot.co.uk
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William 

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All,

I recently blogged about some of the work we're doing with a large scale 
regression corpus to make Tika, POI and PDFBox more robust and to identify 
regressions before release.  If you'd like to chip in with recommendations, 
requests or Hadoop/Spark clusters (why not shoot for the stars), please do!

  
http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/

Many thanks, again, to Rackspace for our vm and to Common Crawl and govdocs1 
for most of our files!

Cheers,

 Tim


RE: SOLR Sizing

2016-10-03 Thread Allison, Timothy B.
This doesn't answer your question, but Erick Erickson's blog on this topic is 
invaluable:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

-Original Message-
From: Vasu Y [mailto:vya...@gmail.com] 
Sent: Monday, October 3, 2016 2:09 PM
To: solr-user@lucene.apache.org
Subject: SOLR Sizing

Hi,
 I am trying to estimate disk space requirements for the documents indexed to 
SOLR.
I went through the LucidWorks blog (
https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)
and using this as the template. I have a question regarding estimating "Avg. 
Document Size (KB)".

When calculating Disk Storage requirements, can we use the Java Types sizing (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) & 
come up average document size?

Please let know if the following assumptions are correct.

 Data Type   Size
 --  --
 long   8 bytes
 tint   4 bytes
 tdate 8 bytes (Stored as long?)
 string 1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII chars (Double byte chars)
 text   1 byte per char for ASCII chars and 2 bytes per char for
Non-ASCII (Double byte chars) (For both with & without norm?)  
ICUCollationField 2 bytes per char for Non-ASCII (Double byte chars)  boolean 1 
bit?

 Thanks,
 Vasu


RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Not sure what to do with this one.

The triggering document has a run of ~50  starts and then ~50+  
starts.  So, y, Tika limits nested elements to 100.

Tika's DefaultHtmlMapper only passes through a few handfuls of elements 
(SAFE_ELEMENTS), not including  or . 

Solr's MostlyPassThroughHtmlMapper passes through, well, mostly everything.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, September 22, 2016 12:47 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Disabling Zip bomb detection in Tika

So far a Tika JIRA seems like the right thing. Tim is "a well known entity"
in Solr though so I'm sure he'll move it over to Solr if appropriate.

Erick

On Thu, Sep 22, 2016 at 9:43 AM, Rodrigo Rosenfeld Rosas 
<rr_ro...@yahoo.com.br.invalid> wrote:
> Here it is. Not sure if it's clear enough though:
>
> https://issues.apache.org/jira/browse/TIKA-2091
>
> Or should I have created the ticket in the Solr project instead?
>
>
> Em 22-09-2016 13:32, Rodrigo Rosenfeld Rosas escreveu:
>>
>> This is one of the documents:
>>
>>
>> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e6
>> 11133_f6ef-eutelsat.htm
>>
>> I'll try to create a ticket for this on Jira if I find its location 
>> but feel free to open it yourself if you prefer, just let me know.
>>
>> Em 22-09-2016 12:33, Allison, Timothy B. escreveu:
>>>>
>>>> I'll try to get a sample HTML yielding to this problem and attach 
>>>> it to Jira.
>>>
>>> Great!  Tika 1.14 is around the corner...if this is an easy fix ... 
>>> :)
>>>
>>> Thank you.
>>>
>>
>>
>


RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Tika might be overkill for you (no one can hear us, right?).  


One thing that Tika buys you is fairly smart encoding detection for html pages. 
 Looks like Nokogiri does do some kind of encoding detection, but it may only 
read the meta-headers.  I haven't used Nokogiri, but if you're happy with the 
results of that, go for it.


-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 12:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Disabling Zip bomb detection in Tika

Great, thanks for the URL, I'll check that.

I was wondering if maybe Tika would be an overkill solution to my specific 
case. We don't index PDF, DOC or anything like that, just plain HTML.

I mean, if everything Tika does is to extract text from HTML, maybe I could get 
the same result using Nokogiri directly in Ruby and send it as plain text to 
Solr? Am I missing something? What would Tika do besides extracting the text 
from the HTML?

Thanks in advance,
Rodrigo.

Em 22-09-2016 12:11, Erick Erickson escreveu:
> Tika was upgraded from 1.7 to 1.13 in Solr 6.2 so this is likely a 
> change in Tika.
>
> You could _try_ downgrading Tika, but that's chancy and I have no 
> guarantee that it'll work.
>
> Or use a SolrJ client to use an older version of Tika and transmit it 
> to Solr, here's an example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Sep 22, 2016 at 8:01 AM, Rodrigo Rosenfeld Rosas 
> <rr_ro...@yahoo.com.br.invalid> wrote:
>> I forgot to mention that this problem just happened after I upgraded 
>> to a recent version of Solr and tried to reindex all documents. Some 
>> documents that had previously succeeded now failed with this error.
>>
>> Em 22-09-2016 11:58, Rodrigo Rosenfeld Rosas escreveu:
>>> Hi, thanks. I was talking to @elyograg over freenode#solr and he (or 
>>> she, can't know by the nickname) recommended me to create a Java app 
>>> integrating SolrJ and Tika to perform the indexing. Is this the only 
>>> way to achieve that with Solr? Since I'm not usually a Java 
>>> developer, I'd prefer another kind of solution, but if there isn't, 
>>> I'll have to look at the Java API and examples for SolrJ and Tika to 
>>> achieve that...
>>>
>>> Just wanted to confirm. I'll try to get a sample HTML yielding to 
>>> this problem and attach it to Jira.
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>> Em 22-09-2016 11:48, Allison, Timothy B. escreveu:
>>>> Y, looks like Nick (gagravarr) has answered on SO -- can't do it in 
>>>> Tika currently.
>>>>
>>>> -Original Message-
>>>> From: Allison, Timothy B. [mailto:talli...@mitre.org]
>>>> Sent: Thursday, September 22, 2016 10:42 AM
>>>> To: solr-user@lucene.apache.org
>>>> Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
>>>> Subject: RE: Disabling Zip bomb detection in Tika
>>>>
>>>> I don't think that's configurable at the moment.
>>>>
>>>> Tika-colleagues, any recommendations?
>>>>
>>>> If you're able to share the file on Tika's jira, we'd be happy to 
>>>> take a look.  You shouldn't be getting the zip bomb unless there is 
>>>> a mismatch between opening and closing tags (which could point to a bug in 
>>>> Tika).
>>>>
>>>> -Original Message-
>>>> From: Rodrigo Rosenfeld Rosas 
>>>> [mailto:rr_ro...@yahoo.com.br.INVALID]
>>>> Sent: Thursday, September 22, 2016 10:06 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Disabling Zip bomb detection in Tika
>>>>
>>>> Hi, this is my first message in this list.
>>>>
>>>> Is it possible to disable Zip bomb detection in the Tika handler?
>>>>
>>>> I've also described the problem here:
>>>>
>>>>
>>>> http://stackoverflow.com/questions/39628519/how-to-disable-or-incre
>>>> ase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#
>>>> comment66575342_39628519
>>>>
>>>> Basically, I get this error when trying to process some big valid 
>>>> HTML
>>>> documents:
>>>>
>>>> RSolr::Error::Http - 500 Internal Server Error
>>>> Error:
>>>>
>>>> {'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache

RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.

Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)

Thank you.



RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika 
currently.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
Subject: RE: Disabling Zip bomb detection in Tika

I don't think that's configurable at the moment.  

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a look.  
You shouldn't be getting the zip bomb unless there is a mismatch between 
opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error: 
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
 
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
 at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb detection 
or to increase the limit using configuration files? I noticed it's possible to 
add a tika.config file but I have no idea on how to specify what I want in such 
Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.


RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
I don't think that's configurable at the moment.  

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a look.  
You shouldn't be getting the zip bomb unless there is a mismatch between 
opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error: 
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
 
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
 at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb detection 
or to increase the limit using configuration files? I noticed it's possible to 
add a tika.config file but I have no idea on how to specify what I want in such 
Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.


RE: Solr 6.1 :: language specific analysis

2016-08-10 Thread Allison, Timothy B.
ICU normalization (ICUFoldingFilterFactory) will at least handle "ß" -> "ss" 
(IIRC) and some other language-general variants that might get you close.  
There are, of course, language specific analyzers 
(https://wiki.apache.org/solr/LanguageAnalysis#German) , but I don't think 
they'll get you Foto->photo.  

You might experiment with DoubleMetaphone encoding 
(DoubleMetaphoneFilterFactory) or, worst case, back off to synonym lists 
(SynonymFilterFactory) for your domain.

-Original Message-
From: Rainer Gnan [mailto:rainer.g...@bsb-muenchen.de] 
Sent: Wednesday, August 10, 2016 10:21 AM
To: solr-user@lucene.apache.org
Subject: Solr 6.1 :: language specific analysis

Hello,

I wonder if solr offers a feature (class) to handle different orthogaphy 
versions?
For the German language for example ... in order to find the same documents 
when searching after "Foto" or "Photo".

I appreachiate any help!

Rainer



Rainer Gnan
Bayerische Staatsbibliothek 
BibliotheksVerbund Bayern
Verbundnahe Dienste
80539 München
Tel.: +49(0)89/28638-2445
Fax: +49(0)89/28638-2665
E-Mail: rainer.g...@bsb-muenchen.de






RE: Automatic Language Identification

2016-07-01 Thread Allison, Timothy B.
+1 to langdetect

In Tika 2.0, we're going to remove our own language detection code and allow 
users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl 
library or Yalder (https://github.com/kkrugler/yalder).  The first two are now 
available in Tika 1.13.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, June 22, 2016 8:27 AM
To: solr-user@lucene.apache.org; solr-user 
Subject: RE: Automatic Language Identification

Hello,

I recommend using the langdetect language detector, it supports many more 
languages and has much higher precission than Tika's detector.

Markus
 
 


RE: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-21 Thread Allison, Timothy B.
Not that I need any other book beyond this one... but I didn't realize that the 
50% discount code applies to all books in the order. :)

Congratulations, Doug and John!

-Original Message-
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] 
Sent: Tuesday, June 21, 2016 2:12 PM
To: solr-user@lucene.apache.org
Cc: John Berryman 
Subject: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

Not much more to add than my post here! This book is targeted towards 
Lucene-based search (Elasticsearch and Solr) relevance.

Announcement with discount code:
http://opensourceconnections.com/blog/2016/06/21/relevant-search-published/

Related hacker news thread:
https://news.ycombinator.com/item?id=11946636

Thanks to everyone in the Solr community that was helpful to my efforts.
Specifically Trey Grainger, Eric Pugh (for keeping me employed), Charlie Hull 
and the Flax team, Alex Rafalovitch, Timothy Potter, Yonik Seeley, Grant 
Ingersoll (for basically teaching me Solr back in the day), Drew Farris (for 
encouraging my early blogging), everyone at OSC, and many others I'm probably 
forgetting!

Best
-Doug


RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
>Awesome, 0 pre and 1 post works!

Great!

> What if I wanted to match thirty, but exclude if six or seven are included 
> anywhere in the document?

Any time you need "anywhere in the document", use a "regular" query (not 
SpanQuery).  As you wrote initially, you can construct a BooleanQuery that 
includes a complex SpanQuery and another Query that is 
BooleanClause.Occur.MUST_NOT.

> I also tried 0 pre and 0 post
You'd use those if you wanted to find something that didn't contain something 
else: 

["William Clinton"~2 Jefferson]!~0,0

Find 'william' within two words of 'clinton', but not if 'jefferson' appears 
between them.

> I replaced pre with Integer.MAX_VALUE and post with Integer.MAX_VALUE - 5 and 
> it works!
I'll have to think about this one...



RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
>Perhaps I'm misunderstanding the pre/post parameters?

Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before 
'thirty' or $post tokens after 'thirty'

Maybe something like this:
spanNear([
  spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
  spanNot(field:thirty, spanOr([field:six, field:seven]), 0,
1)
  ], 0, true)



RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
> dtSearch allows a user to have NOTs embedded in proximity searches.

And, if you're heading down the path of building your own queryparser to handle 
dtSearch's syntax, please read and heed Charlie Hull's post:

http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/

See also:

http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/
 



RE: SpanQuery - How to wrap a NOT subquery

2016-06-21 Thread Allison, Timothy B.
In the syntax for LUCENE-5205’s SpanQueryParser 
[0], that’d be

[“one thousand one hundred thirty” (six seven)]!~0,1

In English: find “one thousand one hundred thirty”, but not if six or seven 
comes immediately after it.

[0] https://github.com/tballison/lucene-addons/tree/master/lucene-5205

From: Brandon Miller [mailto:computerengineer.bran...@gmail.com]
Sent: Monday, June 20, 2016 4:12 PM
To: Allison, Timothy B. <talli...@mitre.org>; solr-user@lucene.apache.org
Subject: Re: SpanQuery - How to wrap a NOT subquery

Thank you, Timothy.

I have support for and am using SpanNotQuery elsewhere.  Maybe there is another 
use for it that I'm not considering.  I'm wondering if there's a clever way of 
reusing it in order to satisfy the requirements of proximity NOTs, too.

dtSearch allows a user to have NOTs embedded in proximity searches.
I.e.
Let's say you have an index whose ID has been converted to English phrases, 
like 1001 would be "One thousand one"

"one thousand one hundred" pre/0 (thirty and not (six or seven))
Returns: 1130, 1131, 1132, 1133, 1134, 1135,1138, 1139

Perhaps I've been staring at the screen too long and the obvious answer is 
hiding from me.

Here's how I'm trying to implement it, but it's incorrect...  It's giving me 
1130..1139 without excluding anything.



public Query visitNot_expr(Not_exprContext ctx) {
  //ProximityNotSupportedFor("NOT");
Query subquery = visit(ctx.expr());
BooleanQuery.Builder query = new BooleanQuery.Builder();
query.add(subquery, BooleanClause.Occur.MUST_NOT);
// TODO: Consolidate this so that we don't use 
MatchAllDocsQuery, but using the other query, to increase performance
query.add(new MatchAllDocsQuery(), 
BooleanClause.Occur.SHOULD);

if(currentlyInASpanQuery){
SpanQuery matchAllDocs = 
getSpanWildcardQuery(new Term(defaultFieldName,"*"));
SpanNotQuery snq = new 
SpanNotQuery(matchAllDocs, (SpanQuery)subquery, Integer.MAX_VALUE, 
Integer.MAX_VALUE);
return snq;
} else {
return query.build();
}
}

protected SpanQuery getSpanWildcardQuery(Term term) {
WildcardQuery wq = new WildcardQuery(term);
   SpanQuery swq = new SpanMultiTermQueryWrapper<>(wq);
   return swq;
    }


On Mon, Jun 20, 2016 at 2:53 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
Bouncing over to user’s list.

As you’ve found, spans are different from regular queries.  MUST_NOT at the 
BooleanQuery level means that the term must not appear anywhere in the 
document; whereas spans focus on terms near each other.

Have you tried SpanNotQuery?  This would allow you at least to do something 
like:

termA but not if zyx or yyy appears X words before or Y words after



From: Brandon Miller 
[mailto:computerengineer.bran...@gmail.com<mailto:computerengineer.bran...@gmail.com>]
Sent: Monday, June 20, 2016 2:36 PM
To: d...@lucene.apache.org<mailto:d...@lucene.apache.org>
Subject: SpanQuery - How to wrap a NOT subquery

Greetings!

I'm wanting to support this:
TermA within_N_terms_of (abc and cba or xyz and not zyx or not yyy)

Focusing on the sub-query:
I have ANDs and ORs figured out (special tricks playing with slops and such).

I'm having the hardest time figuring out how to wrap a NOT.

Outside of SpanQuery, I'm using a BooleanQuery with a MUST_NOT clause.  That's 
fine (if you know another way, I'd like to hear that, too, but this appears to 
work dandy).

However, SpanQuery requires queries that are also of type SpanQuery or 
SpanMultiTermQueryWrapper will allow you to throw in anything derived from 
MultiTermQuery (which includes AutomatedQuery).

Right now, I'm at a loss.  We have huge, complex, nested boolean queries inside 
proximity operators with our current solution.

If I need to write a custom solution, then that's what I need to hear and 
perhaps a couple of pointers.

Thanks a bunch and God bless!

Brandon



Morphlines.cell and attachments in complex docs?

2016-06-17 Thread Allison, Timothy B.
I was just looking at SolrCellBuilder, and it looks like there's an assumption 
that documents will not have attachments/embedded objects.  Unless I 
misunderstand the code, users will not be able to search documents inside zips, 
or attachments in msg/ doc/pdf/etc (cf. SOLR-7189).

Are embedded documents extracted in a step before hitting SolrCellBuilder?

Bug or feature?

Thank you!

 Cheers,

Tim



RE: Bypassing ExtractingRequestHandler

2016-06-13 Thread Allison, Timothy B.



>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should 
>be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/

+1

> We tend to prefer running Tika externally as it's entirely possible 
> that Tika will crash or hang with certain files - and that will bring 
> down Solr if you're running Tika within it.

+1

>> I want to make a small modification 
>> to Tika to get and save additional data from my PDFs
What info do you need, and if it is common enough, could you ask over on Tika's 
JIRA and we'll try to add it directly?





RE: find stores with sales of > $x in last 2 months ?

2016-06-06 Thread Allison, Timothy B.
Thank you, Alex.

> Sorry, your question a bit confusing.
Y. Sorry.

> Also, is this last month as in 'January' (rolling monthly) or as in 'last 30 
> days'
(rolling daily).

Ideally, the latter, if this is possible to calculate dynamically in response 
to a query.  My backoff method (if the 'rolling daily' method isn't possible), 
would be to index monthly stats and then just use the range query as you 
suggested.

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Sunday, June 5, 2016 12:52 AM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: find stores with sales of > $x in last 2 months ?

Are you asking for just numerical comparison during search or about a way to 
aggregate numbers from multiple records? Also, is this last month as in 
'January' (rolling monthly) or as in 'last 30 days'
(rolling daily). Sorry, your question a bit confusing.

Numerical comparison is just a range (numField:[x TO *])  as per

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-RangeSearches

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-DifferencesbetweenLuceneQueryParserandtheSolrStandardQueryParser

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 3 June 2016 at 23:23, Allison, Timothy B. <talli...@mitre.org> wrote:
> All,
>   This is a toy example, but is there a way to search for, say, stores with 
> sales of > $x in the last 2 months with Solr?
>   $x and the time frame are selected by the user at query time.
>
> If the queries could be constrained (this is still tbd), I could see updating 
> "stats" fields within each store document on a daily basis 
> (sales_last_1_month, sales_last_2_months, sales_last_3_months...etc).  The 
> dataset is fairly small and daily updates of this nature would not be 
> prohibitive.
>
>Or, is this trying to use a screw driver where a hammer is required?
>
>Thank you.
>
>Best,
>
>  Tim


find stores with sales of > $x in last 2 months ?

2016-06-03 Thread Allison, Timothy B.
All,
  This is a toy example, but is there a way to search for, say, stores with 
sales of > $x in the last 2 months with Solr?
  $x and the time frame are selected by the user at query time.  

If the queries could be constrained (this is still tbd), I could see updating 
"stats" fields within each store document on a daily basis (sales_last_1_month, 
sales_last_2_months, sales_last_3_months...etc).  The dataset is fairly small 
and daily updates of this nature would not be prohibitive.

   Or, is this trying to use a screw driver where a hammer is required?
 
   Thank you.

   Best,

 Tim


RE: Metadata and HTML ending up in searchable text

2016-05-31 Thread Allison, Timothy B.
>>  From the same page, extractFormat=text only applies when extractOnly 
>> is true, which just shows the output from tika without indexing the document.

Y, sorry.  I just looked through the source code.  You're right.  If you use 
DIH (TikaEntityProcessor) instead of Solr Cell (ExtractingDocumentLoader), you 
should be able to set the handler type by setting the "format" attribute, and 
"text" is one option there.

>>I just want to make sure I'm not missing something really obvious before 
>>submitting a bug report.
I don't think you are.

>>  From the same page, extractFormat=text only applies when extractOnly 
>> is true, which just shows the output from tika without indexing the document.
>> Running it in "extractOnly" mode resulting in a XML output. The 
>> difference between selecting "text" or "xml" format is that the 
>> escaped document in the  tag is either the original HTML 
>> (xml mode) or stripped HTML (text mode). It seems some Javascript 
>> creeps into the text version. (See below)
>>
>> Regards,
>> Simon
>>
>> HTML mode sample:
>>   > name="responseHeader">0> name="QTime">51> name="UsingMailingLists.html">?xml
>> version="1.0" encoding="UTF-8"?
>> html xmlns="http://www.w3.org/1999/xhtml";
>> head
>> link
>>  rel="stylesheet" type="text/css" charset="utf-8" media="all"
>> href="/wiki/modernized/css/common.css"/
>>  link rel="stylesheet" type="text/css" charset="utf-8"
>>  media="screen" href="/wiki/modernized/css/screen.css"/
>>  link rel="stylesheet" type="text/css" charset="utf-8"
>>  media="print" href="/wiki/modernized/css/print.css"/...
>>
>> TEXT mode (Blank lines stripped):
>> 
>> 0> name="QTime">47 
>> UsingMailingLists - Solr Wiki
>> Search:
>> !--// Initialize search form
>> var f = document.getElementById('searchform');
>> f.getElementsByTagName('label')[0].style.display = 'none'; var e = 
>> document.getElementById('searchinput');
>> searchChange(e);
>> searchBlur(e);
>> //--
>> Solr Wiki
>> Login
>>
>>
>>
>>
>>
>>
>> On 27/05/16 13:31, Allison, Timothy B. wrote:
>>> I'm only minimally familiar with Solr Cell, but...
>>>
>>> 1) It looks like you aren't setting extractFormat=text.  According 
>>> to [0]...the default is xhtml which will include a bunch of the metadata.
>>> 2) is there an attr_* dynamic field in your index with type="ignored"?
>>> This would strip out the attr_ fields so they wouldn't even be 
>>> indexed...if you don't want them.
>>>
>>> As for the HTML file, it looks like Tika is failing to strip out the 
>>> style section.  Try running the file alone with tika-app: java -jar 
>>> tika-app.jar -t inputfile.html.  If you are finding the noise there.  
>>> Please open an issue on our JIRA: 
>>> https://issues.apache.org/jira/browse/tika
>>>
>>>
>>> [0]
>>> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with
>>> +Solr+Cell+using+Apache+Tika
>>>
>>>
>>> -Original Message-
>>> From: Simon Blandford [mailto:simon.blandf...@bkconnect.net]
>>> Sent: Thursday, May 26, 2016 9:49 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Metadata and HTML ending up in searchable text
>>>
>>> Hi,
>>>
>>> I am using Solr 6.0 on Ubuntu 14.04.
>>>
>>> I am ending up with loads of junk in the text body. It starts like,
>>>
>>> The JSON entry output of a search result shows the indexed text 
>>> starting with...
>>> body_txt_en: " stream_size 36499 X-Parsed-By 
>>> org.apache.tika.parser.DefaultParser X-Parsed-By"
>>>
>>> And then once it gets to the actual text I get CSS class names 
>>> appearing that were in  or  tags etc.
>>> e.g. "the power of calibre3 silence calibre2 and", where 
>>> "calibre3" etc are the CSS class names.
>>>
>>> All this junk is searchable and is polluting the index.
>>>
>>> I would like to index _only_ the actual content I am interested in 
>>> searching for.
>>>
>>> Steps to reproduce:
>>>
>>> 1) So

RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
Of course, for greater control over indexing (and for more robust handling of 
exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:

http://searchhub.org/2012/02/14/indexing-with-solrj/

-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] 
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with...
body_txt_en: " stream_size 36499 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By"

And then once it gets to the actual text I get CSS class names appearing that 
were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where "calibre3" etc 
are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching 
for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1=attr_=body_txt_en=true;
 
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2=attr_=body_txt_en=true;
 
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL: 
http://localhost:8983/solr/mycore/select?q=especially=json

Result:

For the txt file, I get the following JSON for the document...

{
 id: "doc1",
 attr_stream_size: [
 "8107"
 ],
 attr_x_parsed_by: [
 "org.apache.tika.parser.DefaultParser",
 "org.apache.tika.parser.txt.TXTParser"
 ],
 attr_stream_content_type: [
 "text/plain"
 ],
 attr_stream_name: [
 "UsingMailingLists.txt"
 ],
 attr_stream_source_info: [
 "content/UsingMailingLists.txt"
 ],
 attr_content_encoding: [
 "ISO-8859-1"
 ],
 attr_content_type: [
 "text/plain; charset=ISO-8859-1"
 ],
 body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By 
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain stream_name 
UsingMailingLists.txt stream_source_info content/UsingMailingLists.txt 
Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 Search: 
[value ] [Titles] [Text] Solr_Wiki Login ** UsingMailingLists ** * 
FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
 id: "doc2",
 attr_stream_size: [
 "20440"
 ],
 attr_x_parsed_by: [
 "org.apache.tika.parser.DefaultParser",
 "org.apache.tika.parser.html.HtmlParser"
 ],
 attr_stream_content_type: [
 "text/html"
 ],
 attr_stream_name: [
 "UsingMailingLists.html"
 ],
 attr_stream_source_info: [
 "content/UsingMailingLists.html"
 ],
 attr_dc_title: [
 "UsingMailingLists - Solr Wiki"
 ],
 attr_content_encoding: [
 "UTF-8"
 ],
 attr_robots: [
 "index,nofollow"
 ],
 attr_title: [
 "UsingMailingLists - Solr Wiki"
 ],
 attr_content_type: [
 "text/html; charset=utf-8"
 ],
 body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki: 
UsingMailingLists
/solr/UsingMailingLists?diffs=1_att=1=rss_rc=0=UsingMailingLists=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type 
text/html stream_name UsingMailingLists.html stream_source_info...etc",
 _version_: 1535398408383103000
}





RE: Metadata and HTML ending up in searchable text

2016-05-27 Thread Allison, Timothy B.
I'm only minimally familiar with Solr Cell, but...

1) It looks like you aren't setting extractFormat=text.  According to [0]...the 
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"?  This 
would strip out the attr_ fields so they wouldn't even be indexed...if you 
don't want them.

As for the HTML file, it looks like Tika is failing to strip out the style 
section.  Try running the file alone with tika-app: java -jar tika-app.jar -t 
inputfile.html.  If you are finding the noise there.  Please open an issue on 
our JIRA: https://issues.apache.org/jira/browse/tika


[0] 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika


-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.net] 
Sent: Thursday, May 26, 2016 9:49 AM
To: solr-user@lucene.apache.org
Subject: Metadata and HTML ending up in searchable text

Hi,

I am using Solr 6.0 on Ubuntu 14.04.

I am ending up with loads of junk in the text body. It starts like,

The JSON entry output of a search result shows the indexed text starting with...
body_txt_en: " stream_size 36499 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By"

And then once it gets to the actual text I get CSS class names appearing that 
were in  or  tags etc.
e.g. "the power of calibre3 silence calibre2 and", where "calibre3" etc 
are the CSS class names.

All this junk is searchable and is polluting the index.

I would like to index _only_ the actual content I am interested in searching 
for.

Steps to reproduce:

1) Solr installed by untaring solr tgz in /opt.

2) Core created by typing "bin/solr create -c mycore"

3) Solr started with bin/solr start

4) TXT document index using the following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1=attr_=body_txt_en=true;
 
-F
"content/UsingMailingLists.txt=@/home/user/Documents/library/UsingMailingLists.txt"

5) HTML document index using following command curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc2=attr_=body_txt_en=true;
 
-F
"content/UsingMailingLists.html=@/home/user/Documents/library/UsingMailingLists.html"

6) Query using URL: 
http://localhost:8983/solr/mycore/select?q=especially=json

Result:

For the txt file, I get the following JSON for the document...

{
 id: "doc1",
 attr_stream_size: [
 "8107"
 ],
 attr_x_parsed_by: [
 "org.apache.tika.parser.DefaultParser",
 "org.apache.tika.parser.txt.TXTParser"
 ],
 attr_stream_content_type: [
 "text/plain"
 ],
 attr_stream_name: [
 "UsingMailingLists.txt"
 ],
 attr_stream_source_info: [
 "content/UsingMailingLists.txt"
 ],
 attr_content_encoding: [
 "ISO-8859-1"
 ],
 attr_content_type: [
 "text/plain; charset=ISO-8859-1"
 ],
 body_txt_en: " stream_size 8107 X-Parsed-By 
org.apache.tika.parser.DefaultParser X-Parsed-By 
org.apache.tika.parser.txt.TXTParser stream_content_type text/plain stream_name 
UsingMailingLists.txt stream_source_info content/UsingMailingLists.txt 
Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 Search: 
[value ] [Titles] [Text] Solr_Wiki Login ** UsingMailingLists ** * 
FrontPage * RecentChanges...etc",
_version_: 1535398235801124900
}

For the HTML file,  I get the following JSON for the document...

{
 id: "doc2",
 attr_stream_size: [
 "20440"
 ],
 attr_x_parsed_by: [
 "org.apache.tika.parser.DefaultParser",
 "org.apache.tika.parser.html.HtmlParser"
 ],
 attr_stream_content_type: [
 "text/html"
 ],
 attr_stream_name: [
 "UsingMailingLists.html"
 ],
 attr_stream_source_info: [
 "content/UsingMailingLists.html"
 ],
 attr_dc_title: [
 "UsingMailingLists - Solr Wiki"
 ],
 attr_content_encoding: [
 "UTF-8"
 ],
 attr_robots: [
 "index,nofollow"
 ],
 attr_title: [
 "UsingMailingLists - Solr Wiki"
 ],
 attr_content_type: [
 "text/html; charset=utf-8"
 ],
 body_txt_en: " stylesheet text/css utf-8 all 
/wiki/modernized/css/common.css stylesheet text/css utf-8 screen 
/wiki/modernized/css/screen.css stylesheet text/css utf-8 print 
/wiki/modernized/css/print.css stylesheet text/css utf-8 projection 
/wiki/modernized/css/projection.css alternate Solr Wiki: 
UsingMailingLists
/solr/UsingMailingLists?diffs=1_att=1=rss_rc=0=UsingMailingLists=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw Alternate print Print View 
/solr/UsingMailingLists?action=print Search /solr/FindPage Index 
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting 
stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By 

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.
>...and I've just blogged about some of the issues one can run into with this 
>sort of project, hope this is useful!
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/

+1 completely non-trivial task to roll your own.

I'd add that incorporating multiterm analysis (analysis/normalization of 
wildcard, fuzzy, prefix, regex, etc) is a fundamental requirement too often 
overlooked.  If you don't do this correctly, you'll get results, but not all 
that you should be getting -- you won't know what you can't find. :)  

It would be great if Uwe could add a check for improperly ignoring 
normalization of multiterms to his forbiddenapis. :)


RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.
Depending on your needs, you might want to take a look at my SpanQueryParser 
(LUCENE-5205/SOLR-5410).  It does not offer dtsearch syntax, but if the 
SurroundQueryParser was close enough, this parser may be of use.  If you need 
modifications to it, let me know.  I'm in the process of adding 
SpanPositionRangeQuery syntax.

If you need to roll your own, beware, it is not a trivial task.  The 
SimpleQueryParser might offer the cleanest example to build on top of.

Working versions of LUCENE-5205/SpanQueryParser are available on my github 
site.  If you are using Lucene/Solr 5.5, for example, go to this branch:

https://github.com/tballison/lucene-addons/tree/lucene5.5-0.1



-Original Message-


From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Friday, May 13, 2016 5:41 AM
To: solr-user@lucene.apache.org
Subject: Re: dtSearch parser & Introduction

On 12/05/2016 23:50, Brandon Miller wrote:
> Hello, all!  I'm a BloombergBNA employee and need to obtain/write a 
> dtSearch parser for solr (and probably a bunch of other things a 
> little later).
> I've looked at the available parsers and thought that the surround 
> parser may do the trick, but it apparently doesn't like nested N or W 
> subqueries.
> I looked at XmlQueryParser and I'm most impressed with it from a 
> functionality perspective.  I liked the SpanQueries, but I either 
> don't understand SpanNot or it has a bug for the exclude.
> At the end of the day, we will need to continue to support dtSearch 
> syntax.  I may as well just bite the bullet and write the dtSearch 
> parser and include it as a patch for Solr.

Hi Brandon,

We have a version of a dtSearch/Lucene query parser written a few years
ago: 
http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/

It would need some work to bring it up to date with the latest version of Solr 
(which is why we're not offering it for download any more), but it would save 
you a lot of time. We've also built parsers for Verity's query language and 
some others - just so you're warned, writing parsers isn't an easy task for a 
beginner, often to support what looks like a simple query in your old language 
can involve some quite complex work on the Lucene side.

Best

Charlie

>
> Here are my immediate issues:
>- I don't know the best path forward on making the parser (I saw 
> something in the HowToContribute page at the bottom about JFlex)  -  
> Can someone please take pity on me and help me get started down this 
> path?  I probably won't need a lot of help.
>- I'm great at .NET, not so much Java--yet.  I've not yet been able 
> to build a trunk and "deploy" it (I can build it and run tests, but 
> not run it--I'm sure I'm just missing an elusive documentation link on 
> how to do
> that)
>- I downloaded and got the solr trunk in Eclipse.  I'm not sure the 
> best way of adding unit tests for my stuff--do I add it to an existing 
> subdirectory or create a new package?
>
> I think it'd be great if I could get a bare-bones example of a parser 
> so that I can modify it--perhaps even keeping it in a separate Java project.
>
> Don't feel like you have to answer all of my questions--an answer to 
> any of them would be quite helpful.
>
> Thank you guys and God bless!
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: Indexing a (File attached to a document)

2016-05-12 Thread Allison, Timothy B.
If I understand the question correctly...

I'm assuming you are indexing rich documents (PDF/DOC/MSG, etc) with DIH's Tika 
handler.  Some of those documents have attachments.

If that's the case, all of the content of embedded docs _should_[0] be 
extracted, but then all of that content across the main document and the 
embedded documents is concatenated into one big string.

If you want to handle attachments with greater precision, the best bet is using 
SolrJ [1] in combination with Tika's RecursiveParserWrapper [2].  That wrapper 
returns a list of Metadata objects for each input file.  The list contains one 
Metadata object for each "document" (one for the container and one for each 
attachment).

So, if I'm right, and you'd like this as part of Solr's DIH, see [3].


[0] https://issues.apache.org/jira/browse/SOLR-7189
[1] https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

[2] 
http://stackoverflow.com/questions/36950382/how-to-extract-content-from-pst-file-using-apache-tika
 

[3] https://issues.apache.org/jira/browse/SOLR-7229 
-Original Message-
From: Reth RM [mailto:reth.ik...@gmail.com] 
Sent: Thursday, May 12, 2016 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing a (File attached to a document)

Could you please let us know which crawler are you using to fetch data from 
document and its attachment?


On Thu, May 12, 2016 at 3:26 PM, Solr User  wrote:

> Hi
>
> If I index a document with a file attachment attached to it in solr, 
> can I visualise data of that attached file attachment also while 
> querying that particular document? Please help me on this
>
>
> Thanks & Regards
> Vidya Nadella
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-a-File-attached-to-a-docum
> ent-tp4276334.html Sent from the Solr - User mailing list archive at 
> Nabble.com.
>


RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y, integrating Tika is non-trivial.  I think Uwe adds the dependencies with 
great care by hand by carefully looking at the dependency tree in Maven and 
making sure there weren't any conflicts.


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, May 4, 2016 2:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

On 5/4/2016 9:21 AM, Betsey Benagh wrote:
> I’m feeling particularly dense, because I don’t see any Tika jars in
> WEB-INF/lib:

Oops. Sorry about that, I forgot that it's all contrib.  That's my mistake, not 
yours.

The Tika jars are in contrib/extraction/lib, along with a very large number of 
dependencies.

It turns out that I probably have no idea what I'm talking about.  I cannot 
find any version 1.12 downloads on Tika's website that are structured the same 
way as what's in our contrib directory, so I have no idea how to actually do 
the manual upgrade.

I seem to remember hearing about people doing a Tika upgrade manually, but I've 
got no idea how they did it.

Thanks,
Shawn



RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
Y. Solr 6.0.0 is shipping with Tika 1.7.  Grobid came in with Tika 1.11.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, May 4, 2016 10:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Integrating grobid with Tika in solr

I think Solr is using a version of Tika that predates that addition of the 
Grobid parser.  You'll have to add that manually somehow until Solr upgrades to 
Tika 1.13 (soon to be released...I think).  SOLR-8981.

-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] 
Sent: Wednesday, May 4, 2016 10:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.ja

RE: Integrating grobid with Tika in solr

2016-05-04 Thread Allison, Timothy B.
I think Solr is using a version of Tika that predates that addition of the 
Grobid parser.  You'll have to add that manually somehow until Solr upgrades to 
Tika 1.13 (soon to be released...I think).  SOLR-8981.

-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com] 
Sent: Wednesday, May 4, 2016 10:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Integrating grobid with Tika in solr

Grobid runs as a service, and I'm (theoretically) configuring Tika to call it.

>From the Grobid wiki, here are instructions for integrating with Tika 
>application:

First we need to create the GrobidExtractor.properties file that points to the 
Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:[port]

Now you can run GROBID via Tika-app with the following command on a sample PDF 
file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI 
--config=$HOME/src/grobidparser-resources/tika-config.xml -J 
$HOME/src/grobid/papers/ICSE06.pdf

Here's the stack trace.

org.apache.solr.common.SolrExceptionjava.lang.ClassNotFoundExceptionorg.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParserorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unable to find a parser class: 
org.apache.tika.parser.journal.JournalParser
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:82)
at 
org.apache.solr.core.PluginBag$LazyPluginHolder.createInst(PluginBag.java:367)
at org.apache.solr.core.PluginBag$LazyPluginHolder.get(PluginBag.java:348)
at org.apache.solr.core.PluginBag.get(PluginBag.java:148)
at 
org.apache.solr.handler.RequestHandlerBase.getRequestHandler(RequestHandlerBase.java:231)
at org.apache.solr.core.SolrCore.getRequestHandler(SolrCore.java:1362)
at 
org.apache.solr.servlet.HttpSolrCall.extractHandlerFromURLPath(HttpSolrCall.java:326)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:296)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:412)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:225)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:183)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unable to find a parser 
class: org.apache.tika.parser.journal.JournalParser
at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:362)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:127)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:115)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:111)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:92)
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:80)
... 30 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.parser.journal.JournalParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method) at 
java.lang.Class.forName(Class.java:348)
at org.apache.tika.config.ServiceLoader.getServiceClass(ServiceLoader.java:189)
at 

RE: Overall large size in Solr across collections

2016-04-26 Thread Allison, Timothy B.
> I can tell you that Tika is  quite the resource hog.  It is likely chewing up 
> CPU and memory 
> resources at an incredible rate, slowing down your Solr server.  You 
> would probably see better performance than ERH if you incorporate Tika 
> and SolrJ into a client indexing program that runs on a different machine 
> than Solr.

+1

It'd be interesting to see what happens if you use standalone tika-batch to see 
what the performance is.  

java -jar tika-app.jar -i  -o 

and if you're feeling adventurous:

java -jar tika-app.jar -i  -o  -J -t

You can specify the number of threads with -numConsumers 5 (don't use many more 
than # of cpus!)

Content extraction with Tika is usually slower (sometimes far slower) than the 
indexing step.  If you have any crazily slow docs, open an issue on Tika's JIRA.

Cheers,
 
  Tim



-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Thursday, April 21, 2016 12:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Overall large size in Solr across collections

Hi Shawn,

Yes, I'm using the Extracting Request Handler.

The 0.7GB/hr is the indexing rate at which the size of the original documents 
which get ingested into Solr. This means that for every hour, only 0.7GB of my 
documents gets ingested into Solr. It will require 10 hours just to index 
documents which are of 7GB in size.

Regards,
Edwin




RE: Indexing docuements in Solr 5 Using Tika extraction error

2016-03-28 Thread Allison, Timothy B.

> If you're going to use Tika for production indexing, you should write 
> a Java program using SolrJ and Tika so that you are in complete 
> control, and so Solr isn't unstable.

+1

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3cby2pr09mb11210edfcfa297528940b07c7...@by2pr09mb112.namprd09.prod.outlook.com%3E


RE: outlook email file pst extraction problem

2016-03-02 Thread Allison, Timothy B.
This is probably more of a Tika question now...

It sounds like Tika is not extracting dates from the .eml files that you are 
generating?  To confirm, you are able to extract dates with libpst...it is just 
that Tika is not able to process the dates that you are sending it in your .eml 
files?

If you are able to share an .eml file (either via personal email or open a 
ticket on Tika's jira if you think this is a bug in Tika), I can take a look.

-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Monday, February 29, 2016 7:17 PM
To: solr-user@lucene.apache.org
Subject: Re: outlook email file pst extraction problem

Thanks Timothy for your prompt help.

 I tried first option. I am able to extract .eml ( MIME format) files from PST 
file using libpst library.
 I am not able extract .msg ( outlook emails) files using libpst library. I am 
able to feed .eml files into SOLR.
 I can see some of tags are missing in the extraction of .eml files in SOLR. 
Specially date tags are missing in the .eml file tags comparative with .msg 
file generated tags. How to generate date tags with .eml files.
My SOLR program stopped working due lack of date tags and same program worked 
file  with .msg files. Any suggestion to generate date tags with .eml  files?  
Is it good idea to look JPST or aspose ( both are 3rd party libraries to 
extract .msg files from PST file) for case?

Advanced Thanks.

--sreenivasa kallu

On Thu, Feb 11, 2016 at 11:55 AM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Should have looked at how we handle psts before earlier responsesorry.
>
> What you're seeing is Tika's default treatment of embedded documents, 
> it concatenates them all into one string.  It'll do the same thing for 
> zip files and other container files.  The default Tika format is 
> xhtml, and we include tags that show you where the attachments are.  
> If the tags are stripped, then you only get a big blob of text, which 
> is often all that's necessary for search.
>
> Before SOLR-7189, you wouldn't have gotten any content, so that's 
> progress...right?
>
> Some options for now:
> 1) use java-libpst as a preprocessing step to extract contents from 
> your psts before you ingest them in Solr (feel free to borrow code 
> from our OutlookPSTParser).
> 2) use tika from the commandline with the -J -t options to get a Json 
> representation of the overall file, which includes a list of maps, 
> where each map represents a single embedded file.  Again, if you have 
> any questions on this, head over to u...@tika.apache.org
>
> I think what you want is something along the lines of SOLR-7229, which 
> would treat each embedded document as its own document.  That issue is 
> not resolved, and there's currently no way of doing this within DIH 
> that I'm aware of.
>
> If others on this list have an interest in SOLR-7229, let me know, and 
> I'll try to find some time.  I'd need feedback on some design decisions.
>
>
>
>
>
> -Original Message-
> From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com]
> Sent: Thursday, February 11, 2016 1:43 PM
> To: solr-user@lucene.apache.org
> Subject: outlook email file pst extraction problem
>
> Hi ,
>I am currently indexing individual outlook messages and 
> searching is working fine.
> I have created solr core using following command.
>  ./solr create -c sreenimsg1 -d data_driven_schema_configs
>
> I am using following command to index individual messages.
> curl  "
>
> http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9
> refix=attr_=attr_content=true
> "
> -F "myfile=@/home/ec2-user/msg9.msg"
>
> This setup is working fine.
>
> But new requirement is extract messages using outlook pst file.
> I tried following command to extract messages from outlook pst file.
>
> curl  "
>
> http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7
> prefix=attr_=attr_content=true
> "
> -F "myfile=@/home/ec2-user/sateamc_0006.pst"
>
> This command extracting only high level tags and extracting all 
> messages into one message. I am not getting all tags when extracted 
> individual messgaes. is above command is correct? is it problem not using 
> recursion?
>  how to add recursion to above command ? is it tika library problem?
>
> Please help to solve above problem.
>
> Advanced Thanks.
>
> --sreenivasa kallu
>


RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Should have looked at how we handle psts before earlier responsesorry.

What you're seeing is Tika's default treatment of embedded documents, it 
concatenates them all into one string.  It'll do the same thing for zip files 
and other container files.  The default Tika format is xhtml, and we include 
tags that show you where the attachments are.  If the tags are stripped, then 
you only get a big blob of text, which is often all that's necessary for search.

Before SOLR-7189, you wouldn't have gotten any content, so that's 
progress...right?

Some options for now:
1) use java-libpst as a preprocessing step to extract contents from your psts 
before you ingest them in Solr (feel free to borrow code from our 
OutlookPSTParser).
2) use tika from the commandline with the -J -t options to get a Json 
representation of the overall file, which includes a list of maps, where each 
map represents a single embedded file.  Again, if you have any questions on 
this, head over to u...@tika.apache.org

I think what you want is something along the lines of SOLR-7229, which would 
treat each embedded document as its own document.  That issue is not resolved, 
and there's currently no way of doing this within DIH that I'm aware of.

If others on this list have an interest in SOLR-7229, let me know, and I'll try 
to find some time.  I'd need feedback on some design decisions.





-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Thursday, February 11, 2016 1:43 PM
To: solr-user@lucene.apache.org
Subject: outlook email file pst extraction problem

Hi ,
   I am currently indexing individual outlook messages and searching is 
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9=attr_=attr_content=true;
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7=attr_=attr_content=true;
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages into 
one message. I am not getting all tags when extracted individual messgaes. is 
above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.

--sreenivasa kallu


RE: outlook email file pst extraction problem

2016-02-11 Thread Allison, Timothy B.
Y, this looks like a Tika feature.  If you run the tika-app.jar [1]on your file 
and you get the same output, then that's Tika's doing.

Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs.

-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com] 
Sent: Thursday, February 11, 2016 1:43 PM
To: solr-user@lucene.apache.org
Subject: outlook email file pst extraction problem

Hi ,
   I am currently indexing individual outlook messages and searching is 
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9=attr_=attr_content=true;
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7=attr_=attr_content=true;
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages into 
one message. I am not getting all tags when extracted individual messgaes. is 
above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.

--sreenivasa kallu


RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
x-post to Tika user's

Y and n.  If you run tika app as: 

java -jar tika-app.jar  

It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This 
creates a parent and child process, if the child process notices a hung thread, 
it dies, and the parent restarts it.  Or if your OS gets upset with the child 
process and kills it out of self preservation, the parent restarts the child, 
or if there's an OOM...and you can configure how often the child shuts itself 
down (with parental restarting) to mitigate memory leaks.

So, y, if your use case allows  , then we now have that 
in Tika.

I've been wanting to add a similar watchdog to tika-server ... any interest in 
that?


-Original Message-
From: xavi jmlucjav [mailto:jmluc...@gmail.com] 
Sent: Thursday, February 11, 2016 2:16 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: How is Tika used with Solr

I have found that when you deal with large amounts of all sort of files, in the 
end you find stuff (pdfs are typically nasty) that will hang tika. That is even 
worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog 
feature to kill what seemed like a hanged extracting thread. That feature is 
super important for a robust text extracting pipeline. Has Tika gained such 
feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if 
> you use the simple "run Tika in a SolrJ program" approach you _must_ 
> abort the program on OOM errors and the like and  figure out what's 
> going on with the offending document(s). Or record the name somewhere 
> and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing 
> with errors one at a time.
> For robust systems where you have to have indexing available at all 
> times and _especially_ where you don't control the document corpus, 
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
> > I completely agree on the impulse, and for the vast majority of the 
> > time
> (regular catchable exceptions), that'll work.  And, by vast majority, 
> aside from oom on very large files, we aren't seeing these problems 
> any more in our 3 million doc corpus (y, I know, small by today's 
> standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these 
> types of problems, that users aren't complaining too often about 
> catastrophic failures of Tika within Solr Cell, and that this thread 
> is not yet swamped with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and 
> because Tika is a kitchen sink and we can't prevent memory leaks in 
> our dependencies, one needs to be aware that bad things can 
> happen...if only very, very rarely.  For a fellow traveler who has run 
> into these issues on massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On 
> conventional single box setups, the ForkParser within Tika is one 
> option, tika-batch is another.  Hand rolling your own parent/child 
> process is non-trivial and is not necessary for the vast majority of use 
> cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any 
> real benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> > <talli...@mitre.org>
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
Y, and you can't actually kill a thread.  You can ask nicely via 
Thread.interrupt(), but some of our dependencies don't bother to listen  for 
that.  So, you're pretty much left with a separate process as the only robust 
solution.

So, we did the parent-child process thing for directory-> directory processing 
in tika-app via tika-batch.

The next step is to harden tika-server and to kick that off in a child process 
in a similar way.

For those who want to test their Tika harnesses (whether on single box, 
Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when 
it hits an "application/xml+mock" file...full set of options:




Nikolai Lobachevsky



some content



writing to System.out


writing to System.err







not another IOException





  

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, February 11, 2016 7:46 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: How is Tika used with Solr

Well, I'd imagine you could spawn threads and monitor/kill them as necessary, 
although that doesn't deal with OOM errors

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jmluc...@gmail.com> wrote:
> For sure, if I need heavy duty text extraction again, Tika would be 
> the obvious choice if it covers dealing with hangs. I never used 
> tika-server myself (not sure if it existed at the time) just used tika from 
> my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar  
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
>> This creates a parent and child process, if the child process notices 
>> a hung thread, it dies, and the parent restarts it.  Or if your OS 
>> gets upset with the child process and kills it out of self 
>> preservation, the parent restarts the child, or if there's an 
>> OOM...and you can configure how often the child shuts itself down 
>> (with parental restarting) to mitigate memory leaks.
>>
>> So, y, if your use case allows  , then we now 
>> have that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any 
>> interest in that?
>>
>>
>> -Original Message-
>> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user <solr-user@lucene.apache.org>
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of 
>> files, in the end you find stuff (pdfs are typically nasty) that will hang 
>> tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a 
>> watchdog feature to kill what seemed like a hanged extracting thread. 
>> That feature is super important for a robust text extracting 
>> pipeline. Has Tika gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
>> <erickerick...@gmail.com>
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, 
>> > if you use the simple "run Tika in a SolrJ program" approach you 
>> > _must_ abort the program on OOM errors and the like and  figure out 
>> > what's going on with the offending document(s). Or record the name 
>> > somewhere and skip it next time 'round. Or
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing 
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all 
>> > times and _especially_ where you don't control the document corpus, 
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > <talli...@mitre.org>
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of 
>> > > the time
>> > (regular catchable exceptions), that'll work.  And, by vast 
>> > majority, aside from oom on very large files, we aren't seeing 
>> > these problems any more in our 3 million doc corpus (y, I know, 
>> > small by today's
>> > standards) from
>> > govdocs1 and

RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.
I completely agree on the impulse, and for the vast majority of the time 
(regular catchable exceptions), that'll work.  And, by vast majority, aside 
from oom on very large files, we aren't seeing these problems any more in our 3 
million doc corpus (y, I know, small by today's standards) from govdocs1 and 
Common Crawl over on our Rackspace vm. 

Given my focus on Tika, I'm overly sensitive to the worst case scenarios.  I 
find it encouraging, Erick, that you haven't seen these types of problems, that 
users aren't complaining too often about catastrophic failures of Tika within 
Solr Cell, and that this thread is not yet swamped with integrators agreeing 
with me. :)

However, because oom can leave memory in a corrupted state (right?), because 
you can't actually kill a thread for a permanent hang and because Tika is a 
kitchen sink and we can't prevent memory leaks in our dependencies, one needs 
to be aware that bad things can happen...if only very, very rarely.  For a 
fellow traveler who has run into these issues on massive data sets, see also 
[0].

Configuring Hadoop to work around these types of problems is not too difficult 
-- it has to be done with some thought, though.  On conventional single box 
setups, the ForkParser within Tika is one option, tika-batch is another.  Hand 
rolling your own parent/child process is non-trivial and is not necessary for 
the vast majority of use cases.


[0] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
 



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, February 09, 2016 10:05 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: How is Tika used with Solr

My impulse would be to _not_ run Tika in its own JVM, just catch any exceptions 
in my code and "do the right thing". I'm not sure I see any real benefit in yet 
another JVM.

FWIW,
Erick

On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
> I have one answer here [0], but I'd be interested to hear what Solr 
> users/devs/integrators have experienced on this topic.
>
> [0] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P
> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo
> ok.com%3E
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Tuesday, February 09, 2016 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How is Tika used with Solr
>
> Thank you Erick and Alex.
>
> My main question is with a long running process using Tika in the same JVM as 
> my application.  I'm running my file-system-crawler in its own JVM (not 
> Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's 
> own JVM and invoke it from my file-system-crawler using 
> Runtime.getRuntime().exec().
>
> I fully understand from Alex suggestion and link provided by Erick to use 
> Tika outside Solr.  But what about using Tika within the same JVM as my 
> file-system-crawler application or should I be making a system call to invoke 
> another JAR, that runs in its own JVM to extract the raw text?  Are there 
> known issues with Tika when used in a long running process?
>
> Steve
>
>


RE: How is Tika used with Solr

2016-02-10 Thread Allison, Timothy B.
Ha.  Spoke too soon about this thread not getting swamped.

Will add the dropwizard-tika-server to our wiki page.  Thank you for the link!

As a side note, I'll submit a pull request to update the AbstractTikaResource 
to avoid a potential NPE if the mime type can't be parsed...we just fixed this 
over in our tika-server.

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Wednesday, February 10, 2016 3:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How is Tika used with Solr

On 09/02/2016 22:49, Alexandre Rafalovitch wrote:
> Solr uses Tika directly. And not in the most efficient way. It is 
> there mostly for convenience rather than performance.
>
> So, for performance, Solr recommendation is also to run Tika 
> separately and only send Solr the processed documents.

Absolutely. It's entirely possible to kill Tika with a bad PDF or something, 
bringing down your Solr instance.

Here's something a colleague wrote to wrap Tika in a server, maybe you can use 
it:
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
>
> Regards,
>  Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 10 February 2016 at 09:46, Steven White  wrote:
>> Hi folks,
>>
>> I'm writing a file-system-crawler that will index files.  The file 
>> system is going to be very busy an I anticipate on average 10 new 
>> updates per min.  My application checks for new or updated files once 
>> every 1 min.  I use Tika to extract the raw-text off those files and 
>> send them over to Solr for indexing.  My application will be running 
>> 24x7xN-days.  It will not recycle unless if the OS is restarted.
>>
>> Over at Tika mailing list, I was told the following:
>>
>> "As a side note, if you are handling a bunch of files from the wild 
>> in a production environment, I encourage separating Tika into a 
>> separate jvm vs tying it into any post processing – consider 
>> tika-batch and writing separate text files for each file processed 
>> (not so efficient, but exceedingly robust).  If this is demo code or 
>> you know your document set well enough, you should be good to go with 
>> keeping Tika and your postprocessing steps in the same jvm."
>>
>> My question is, how does Solr utilize Tika?  Does it run Tika in its 
>> own JVM as an out-of-process application or does it link with Tika 
>> JARs directly?  If it links in directly, are there known issues with 
>> Solr integrated with Tika because of Tika issues?
>>
>> Thanks
>>
>> Steve


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: How is Tika used with Solr

2016-02-09 Thread Allison, Timothy B.
I have one answer here [0], but I'd be interested to hear what Solr 
users/devs/integrators have experienced on this topic.

[0] 
http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
 

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Tuesday, February 09, 2016 6:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How is Tika used with Solr

Thank you Erick and Alex.

My main question is with a long running process using Tika in the same JVM as 
my application.  I'm running my file-system-crawler in its own JVM (not 
Solr's).  On Tika mailing list, it is suggested to run Tika's code in it's own 
JVM and invoke it from my file-system-crawler using Runtime.getRuntime().exec().

I fully understand from Alex suggestion and link provided by Erick to use Tika 
outside Solr.  But what about using Tika within the same JVM as my 
file-system-crawler application or should I be making a system call to invoke 
another JAR, that runs in its own JVM to extract the raw text?  Are there known 
issues with Tika when used in a long running process?

Steve




RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
Right.  Thank you for reporting the solution.  

Be aware, though, that some parser dependencies are not included with the Solr 
distribution, and, because of the way that Tika currently works, you'll 
silently get no text/metadata from those file types (e.g. sqlite files and 
others).  See [1] for some discussion of this.  If you want the full Tika (with 
all of its messiness) and you are already using SolrJ, use the tika-app.jar.

Your code will correctly extract content from embedded documents, but it will 
not extract metadata from embedded documents/attachments (SOLR-7229).  If you 
want to be able to process metadata from embedded docs, you might consider the 
RecursiveParserWrapper.

Note, too, that if you send in a ParseContext (SOLR-7189) in your call to 
parse, make sure to add the AutoDetectParser or else you will get no content 
from embedded docs.

Both of these will get embedded content:

parser.parse(in, contentHandler, metadata);

Or

ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(in, contentHandler, metadata, context);

This will not:
ParseContext context = new ParseContext();
parser.parse(in, contentHandler, metadata, context);


As you've already done, feel free to ask more Tika-specific questions over on 
tika-user.

Cheers,

   Tim

[1] 
https://issues.apache.org/jira/browse/TIKA-1511?focusedCommentId=14385803=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14385803

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Tuesday, February 02, 2016 7:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Tika that comes with Solr 5.2

I found my issue.  I need to include JARs off: \solr\contrib\extraction\lib\

Steve

On Tue, Feb 2, 2016 at 4:24 PM, Steven White <swhite4...@gmail.com> wrote:

> I'm not using solr-app.jar.  I need to stick with Tika JARs that come 
> with Solr 5.2 and yet get the full text extraction feature of Tika 
> (all file types it supports).
>
> At first, I started to include Tika JARs as needed; I now have all 
> Tika related JARs that come with Solr and yet it is not working.  Here 
> is the
> list: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, 
> tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, 
> kite-morphlines-tika-core-0.12.1.jar
> and kite-morphlines-tika-decompress-0.12.1.jar.  As part of my 
> program, I also have SolrJ JARs and their dependency: 
> solr-solrj-5.2.1.jar, solr-core-5.2.1.jar, etc.
>
> You said "Might not have the parsers on your path within your Solr 
> framework?".  I"m using Tika outside Solr framework.  I'm trying to 
> use Tika from my own crawler application that uses SojrJ to send the 
> raw text to Solr for indexing.
>
> What is it that I am missing?!
>
> Steve
>
> On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
>> Might not have the parsers on your path within your Solr framework?
>>
>> Which tika jars are on your path?
>>
>> If you want the functionality of all of Tika, use the standalone 
>> tika-app.jar, but do not use the app in the same JVM as 
>> Solr...without a custom class loader.  The Solr team carefully prunes 
>> the dependencies when integrating Tika and makes sure that the main parsers 
>> _just work_.
>>
>>
>> -Original Message-
>> From: Steven White [mailto:swhite4...@gmail.com]
>> Sent: Tuesday, February 02, 2016 2:53 PM
>> To: solr-user@lucene.apache.org
>> Subject: Using Tika that comes with Solr 5.2
>>
>> Hi,
>>
>> I'm trying to use Tika that comes with Solr 5.2.  The following code 
>> is not
>> working:
>>
>> public static void parseWithTika() throws Exception {
>> File file = new File("C:\\temp\\test.pdf");
>>
>> FileInputStream in = new FileInputStream(file);
>> AutoDetectParser parser = new AutoDetectParser();
>> Metadata metadata = new Metadata();
>> metadata.add(Metadata.RESOURCE_NAME_KEY, file.getName());
>> BodyContentHandler contentHandler = new BodyContentHandler();
>>
>> parser.parse(in, contentHandler, metadata);
>>
>> String content = contentHandler.toString();   <=== 'content' is always
>> empty
>>
>> in.close();
>> }
>>
>> 'content' is always empty string unless when the file I pass to Tika 
>> is a text file.  Any idea what's the issue?
>>
>> I have also tried sample codes off
>> https://tika.apache.org/1.8/examples.html
>> with the same result.
>>
>>
>> Thanks !!
>>
>> Steve
>>
>
>


RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
>Be aware, though, that some parser dependencies are not included with the Solr 
>distribution, and, because of the way that Tika currently works, you'll 
>silently >get no text/metadata from those file types (e.g. sqlite files and 
>others).  See [1] for some discussion of this.  If you want the full Tika 
>(with all of its messiness) >and you are already using SolrJ, use the 
>tika-app.jar.

Correction, just realized that is mostly true.  We aren't packaging the sqlite 
jar any more in Tika-app (for the same reason that Solr doesn't -- native 
libs), you'll have to grab that and add it to your class path. :)

See also, very recently: 
https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E
 

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, February 03, 2016 7:35 AM
To: solr-user@lucene.apache.org
Subject: RE: Using Tika that comes with Solr 5.2

Right.  Thank you for reporting the solution.  

Be aware, though, that some parser dependencies are not included with the Solr 
distribution, and, because of the way that Tika currently works, you'll 
silently get no text/metadata from those file types (e.g. sqlite files and 
others).  See [1] for some discussion of this.  If you want the full Tika (with 
all of its messiness) and you are already using SolrJ, use the tika-app.jar.

Your code will correctly extract content from embedded documents, but it will 
not extract metadata from embedded documents/attachments (SOLR-7229).  If you 
want to be able to process metadata from embedded docs, you might consider the 
RecursiveParserWrapper.

Note, too, that if you send in a ParseContext (SOLR-7189) in your call to 
parse, make sure to add the AutoDetectParser or else you will get no content 
from embedded docs.

Both of these will get embedded content:

parser.parse(in, contentHandler, metadata);

Or

ParseContext context = new ParseContext(); context.set(Parser.class, parser); 
parser.parse(in, contentHandler, metadata, context);

This will not:
ParseContext context = new ParseContext(); parser.parse(in, contentHandler, 
metadata, context);


As you've already done, feel free to ask more Tika-specific questions over on 
tika-user.

Cheers,

   Tim

[1] 
https://issues.apache.org/jira/browse/TIKA-1511?focusedCommentId=14385803=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14385803

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 02, 2016 7:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Tika that comes with Solr 5.2

I found my issue.  I need to include JARs off: \solr\contrib\extraction\lib\

Steve

On Tue, Feb 2, 2016 at 4:24 PM, Steven White <swhite4...@gmail.com> wrote:

> I'm not using solr-app.jar.  I need to stick with Tika JARs that come 
> with Solr 5.2 and yet get the full text extraction feature of Tika 
> (all file types it supports).
>
> At first, I started to include Tika JARs as needed; I now have all 
> Tika related JARs that come with Solr and yet it is not working.  Here 
> is the
> list: tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, 
> tika-xmp-1.7.jar, vorbis-java-tika-0.6.jar, 
> kite-morphlines-tika-core-0.12.1.jar
> and kite-morphlines-tika-decompress-0.12.1.jar.  As part of my 
> program, I also have SolrJ JARs and their dependency:
> solr-solrj-5.2.1.jar, solr-core-5.2.1.jar, etc.
>
> You said "Might not have the parsers on your path within your Solr 
> framework?".  I"m using Tika outside Solr framework.  I'm trying to 
> use Tika from my own crawler application that uses SojrJ to send the 
> raw text to Solr for indexing.
>
> What is it that I am missing?!
>
> Steve
>
> On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
>> Might not have the parsers on your path within your Solr framework?
>>
>> Which tika jars are on your path?
>>
>> If you want the functionality of all of Tika, use the standalone 
>> tika-app.jar, but do not use the app in the same JVM as 
>> Solr...without a custom class loader.  The Solr team carefully prunes 
>> the dependencies when integrating Tika and makes sure that the main parsers 
>> _just work_.
>>
>>
>> -Original Message-
>> From: Steven White [mailto:swhite4...@gmail.com]
>> Sent: Tuesday, February 02, 2016 2:53 PM
>> To: solr-user@lucene.apache.org
>> Subject: Using Tika that comes with Solr 5.2
>>
>> Hi,
>>
>> I'm trying to use Tika that comes with Solr 5.2.  The following code 
>> is not
>> working:
>>
>> public static void parseWithTika() throws Exception {
>> File file = new File("C:\\

RE: Multi-lingual search

2016-02-02 Thread Allison, Timothy B.
Three basic options: 
1) one generic field that handles non-whitespace languages and normalization 
robustly (downside: no language specific stopwords, stemming, etc)
2) one field per language (hope lang id works and that you don't have many 
multilingual docs)
3) one Solr core for language (ditto)

For the first option (a good first start, no matter what), see:
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/98895 

>  positionIncrementGap="100">
>   
> 
> 
> 
> 
>   
> 

Second two options are well described here: 
http://www.basistech.com/multilingual-search-with-solr-no-problem/ 

See also:
http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/
 


-Original Message-
From: vidya [mailto:vidya.nade...@tcs.com] 
Sent: Monday, February 01, 2016 8:35 AM
To: solr-user@lucene.apache.org
Subject: Multi-lingual search

Hi

 My use case is to index and able to query different languages in solr which 
are not in-built languages supported by solr. How can i implement this ? 

My input document consists of different languages in a field. I came across 
"Solr in action" book with searching content in multiple languages i.e., 
chapter 14. For built in languages i have implemented this approach. But for 
languages like Tamil, how to implement? Do i need to find for filter classes of 
that particular language or any libraries in specific.

Please help me on this.

Thanks in advance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-lingual-search-tp4254398.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using Tika that comes with Solr 5.2

2016-02-02 Thread Allison, Timothy B.
Might not have the parsers on your path within your Solr framework?  

Which tika jars are on your path?

If you want the functionality of all of Tika, use the standalone tika-app.jar, 
but do not use the app in the same JVM as Solr...without a custom class loader. 
 The Solr team carefully prunes the dependencies when integrating Tika and 
makes sure that the main parsers _just work_.
 

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Tuesday, February 02, 2016 2:53 PM
To: solr-user@lucene.apache.org
Subject: Using Tika that comes with Solr 5.2

Hi,

I'm trying to use Tika that comes with Solr 5.2.  The following code is not
working:

public static void parseWithTika() throws Exception {
File file = new File("C:\\temp\\test.pdf");

FileInputStream in = new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
metadata.add(Metadata.RESOURCE_NAME_KEY, file.getName());
BodyContentHandler contentHandler = new BodyContentHandler();

parser.parse(in, contentHandler, metadata);

String content = contentHandler.toString();   <=== 'content' is always
empty

in.close();
}

'content' is always empty string unless when the file I pass to Tika is a text 
file.  Any idea what's the issue?

I have also tried sample codes off https://tika.apache.org/1.8/examples.html
with the same result.


Thanks !!

Steve


RE: When does Solr plan to update its embedded Apache Tika version?

2016-02-02 Thread Allison, Timothy B.
Don't know what the answer from the Solr side is, but from the Tika side, I 
recently failed to get TIKA-1830 into Tika 1.12...so there may be a need to 
wait for Tika 1.13.

No matter the answer on when there'll be an upgrade within Solr, I strongly 
encourage carving Tika into a separate JVM/server from Solr.  If I may beat 
Erick to the punch on this one: 
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ .  See also the 
following for what can go wrong with Tika:

[0] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] 
http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika.pdf

From: Giovanni Usai [mailto:giovanni.u...@francelabs.com]
Sent: Tuesday, February 02, 2016 8:43 AM
To: solr-user@lucene.apache.org
Subject: When does Solr plan to update its embedded Apache Tika version?



I would gladly welcome the reply of the community on the following subject:

Until the last version (5.4.1) Solr is embedding Tika artifacts (in 
contrib/extraction/lib) version 1.7 and dependent artifacts POI version 3.11.
Do you know when do you plan to update the version of Tika to a more recent one?

Just for your information, we are embedding Solr in our open source product 
"Datafari" and we are defining a new Parser that needs a newer version of Tika.

Thanks and

Best regards,
Giovanni Usai
giovanni.u...@francelabs.com

[cid:image001.png@01D15DBF.B2F80370]
www.francelabs.com

CEEI Nice Premium
1 Bd. Maître Maurice Slama
06200 Nice FRANCE

Ph: +33 (0)9 72 43 72 85
-->


RE: Many patterns against many sentences, storing all results

2016-01-05 Thread Allison, Timothy B.
Might want to look into:

https://github.com/flaxsearch/luwak

or 
 https://github.com/OpenSextant/SolrTextTagger  

-Original Message-
From: Will Moy [mailto:w...@fullfact.org] 
Sent: Tuesday, January 05, 2016 11:02 AM
To: solr-user@lucene.apache.org
Subject: Many patterns against many sentences, storing all results

Hello

Please may I have your advice as to whether Solr is a good tool for this job?

We have (per year) –
Up to 50,000,000 sentences
And about 5,000 search patterns (i.e. queries)

Our task is to identify all matches between any sentence and any search pattern.

That list of detections must be kept up to date as patterns are added or 
updated (a handful an hour), and as new sentences are added.

Some of the sentences will be added in real time, at probably max 100 / second 
and usually much less. The detections on these should be provided within 3 
seconds.

It's an unusual application in that we want all results in an external DB, and 
also in that every sentence is either a hit or not. we don't care about scoring 
results, only about matches for the exact search pattern entered.

The application is automatically detecting instances of factchecked statements.

The smaller-scale prototype was done with postgres full text searching, but 
that can't do exact phrase matching or other more sophisticated searches, so 
it's out.

Thanks very much

Will


RE: Unable to extract images content (OCR) from PDF files using Solr

2016-01-05 Thread Allison, Timothy B.
I concur with Erick and Upayavira that it is best to keep Tika in a separate 
JVM...well, ideally a separate box or rack or even data center [0][1]. :)

But seriously, if you're using DIH/SolrCell, you have to configure Tika to 
parse documents recursively.  This was made possible in SOLR-7189...see the 
test case/patch [2] for how to configure this.  Given that this is the behavior 
that most people probably expect, we may want to modify the default setting in 
DIH; this may be a major/breaking default change, though.

As always, please ping the Tika users list if you have any questions.

Looks like we should update our wiki [3] to include guidance on OCR'ing 
embedded images.

[0] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] 
http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika.pdf
[2]https://issues.apache.org/jira/browse/SOLR-7189
[3] https://wiki.apache.org/tika/TikaOCR

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, December 24, 2015 2:52 PM
To: solr-user 
Subject: Re: Unable to extract images content (OCR) from PDF files using Solr

Here's an example of what Upayavira is talking about.
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

It has some RDBMS bits, but you can take those out.

Best,
Erick

On Wed, Dec 23, 2015 at 1:27 AM, Upayavira  wrote:
> If your needs of Tika fall outside of those provided by the embedded 
> Tika, I would suggest you include Tika in your own ingestion pipeline, 
> and just post raw content to Solr. This will probably perform better 
> anyway, as you are otherwise using up valuable Solr resources to do 
> your extraction work, and, as you are seeing, have far less control 
> over what happens inside than you would if Tika was consumed by your 
> own application.
>
> Upayavira
>
> On Wed, Dec 23, 2015, at 03:11 AM, Zheng Lin Edwin Yeo wrote:
>> Hi,
>>
>> I'm also facing the same issue as what you faced 2 months back, like 
>> able to extract the image content if there are in .jpg or .png 
>> format, but not able to extract the images in pdf, even after setting 
>> "extractInlineImages true" in the PDFParser.properties.
>>
>> Have you managed to find alternative solutions to this problem?
>>
>> Regards,
>> Edwin
>>
>> On 22 October 2015 at 18:05, Damien Picard 
>> wrote:
>>
>> > Hi,
>> >
>> > I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content 
>> > from PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>> >
>> > Everything works fine, except when I want to extract content from 
>> > embedding images in PDF/Word etc. documents :
>> >
>> > I send an extract request like this :
>> > POST /update/extract?literal.id
>> > =ocrpdf8=attr_content=attr_
>> >
>> > In attr_content, I get :
>> > \n \n date 2015-08-28T13:23:03Z \n
>> > pdf:PDFVersion 1.4 \n
>> > xmp:CreatorTool PDFCreator Version 1.2.3 \n  stream_content_type 
>> > application/pdf \n  Keywords \n  subject \n  dc:creator S050735 \n  
>> > dcterms:created 2015-08-28T13:23:03Z \n  Last-Modified 
>> > 2015-08-28T13:23:03Z \n  dcterms:modified 2015-08-28T13:23:03Z \n  
>> > dc:format application/pdf; version=1.4 \n  Last-Save-Date 
>> > 2015-08-28T13:23:03Z \n  stream_name imagepdf.pdf \n  
>> > meta:save-date 2015-08-28T13:23:03Z \n  pdf:encrypted false \n  
>> > dc:title imagepdf \n  modified 2015-08-28T13:23:03Z \n  cp:subject 
>> > \n  Content-Type application/pdf \n  stream_size 423660 \n  
>> > X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.pdf.PDFParser \n  creator S050735 \n  
>> > meta:author S050735 \n  dc:subject \n  meta:creation-date 
>> > 2015-08-28T13:23:03Z \n  stream_source_info the-file \n  created 
>> > Fri Aug 28 13:23:03 UTC 2015 \n  xmpTPg:NPages 1 \n  Creation-Date 
>> > 2015-08-28T13:23:03Z \n  meta:keyword \n  Author S050735 \n  
>> > producer GPL Ghostscript 9.04 \n  imagepdf \n  \n  page \n  Page 1 
>> > sur 1\n \n
>> >  28/08/2015
>> > http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
>> > ..
>> > \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg 
>> > embedded:image2.jpg image2.jpg \n
>> >
>> > So, tika works fine, but it doesn't apply OCR content extraction on 
>> > the embedded images.
>> >
>> > When I post an image (JPG) on /update/extract, I get its content 
>> > indexed throught Tesseract OCR (attr_content) field :
>> > \n \n stream_size 55422 \n
>> >  X-Parsed-By org.apache.tika.parser.DefaultParser \n  X-Parsed-By 
>> > org.apache.tika.parser.ocr.TesseractOCRParser \n  
>> > stream_content_type image/jpeg \n  stream_name OM_1.jpg \n  
>> > stream_source_info the-file \n  Content-Type image/jpeg \n \n \n  ‘ 
>> > '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was 
>> > visiting a.\ncertain public school, a school set in a typically 

RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.
Hi Johannes,
  I suspect that Scott's answer would be more efficient than the following, and 
I may be misunderstanding the problem!

 This type of search is supported at the Lucene level by a SpanNearQuery with 
inOrder set to false.
  
 So, how do you get a SpanQuery in Solr?  You might want to look at the 
SurroundQueryParser, and I have an alternate (LUCENE-5205/SOLR-5410) here: 
https://github.com/tballison/lucene-addons. 

 If you do find an appropriate parser, make sure that your position increment 
gap is > 0 on your text field definition, and then you'd never incorrectly get 
a hit across field entries of:

[0] A B
[1] C

Best,
   Tim

On Wed, Dec 16, 2015 at 8:38 AM, Johannes Riedl < 
johannes.ri...@uni-tuebingen.de> wrote:

> Hello all,
>
> we are facing the following problem: we use a multivalued string field 
> that contains entries of the kind A/B/C/, where A,B,C are terms.
> We are now looking for a simple way to also find all permutations of 
> A/B/C, so e.g. B/A/C. As a workaround we added a new field that 
> contains all entries alphabetically sorted and guarantee sorting on the user 
> side.
> However - since this is limited in some ways - is there a simple way 
> to either index in a way such that solely A/B/C and all permutations 
> are found (using e.g. type=text is not an option since a term could 
> occur in a different entry of the multivalued field) or trigger an 
> alphabetical sorting of incoming queries.
>
> Thanks a lot for your feedback, best regards
>
> Johannes
>
>


--
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


RE: Permutations of entries in a multivalued field

2015-12-18 Thread Allison, Timothy B.
Duh, didn't realize you could set inOrder in Solr.  Y, that's the better 
solution.  

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, December 18, 2015 2:27 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Permutations of entries in a multivalued field

The other thing to check is the ComplexPhraseQueryParser, see:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

It uses the Span queries to build up the query...

Best,
Erick

On Fri, Dec 18, 2015 at 11:23 AM, Allison, Timothy B.
<talli...@mitre.org> wrote:
> Hi Johannes,
>   I suspect that Scott's answer would be more efficient than the following, 
> and I may be misunderstanding the problem!
>
>  This type of search is supported at the Lucene level by a SpanNearQuery with 
> inOrder set to false.
>
>  So, how do you get a SpanQuery in Solr?  You might want to look at the 
> SurroundQueryParser, and I have an alternate (LUCENE-5205/SOLR-5410) here: 
> https://github.com/tballison/lucene-addons.
>
>  If you do find an appropriate parser, make sure that your position increment 
> gap is > 0 on your text field definition, and then you'd never incorrectly 
> get a hit across field entries of:
>
> [0] A B
> [1] C
>
> Best,
>Tim
>
> On Wed, Dec 16, 2015 at 8:38 AM, Johannes Riedl < 
> johannes.ri...@uni-tuebingen.de> wrote:
>
>> Hello all,
>>
>> we are facing the following problem: we use a multivalued string 
>> field that contains entries of the kind A/B/C/, where A,B,C are terms.
>> We are now looking for a simple way to also find all permutations of 
>> A/B/C, so e.g. B/A/C. As a workaround we added a new field that 
>> contains all entries alphabetically sorted and guarantee sorting on the user 
>> side.
>> However - since this is limited in some ways - is there a simple way 
>> to either index in a way such that solely A/B/C and all permutations 
>> are found (using e.g. type=text is not an option since a term could 
>> occur in a different entry of the multivalued field) or trigger an 
>> alphabetical sorting of incoming queries.
>>
>> Thanks a lot for your feedback, best regards
>>
>> Johannes
>>
>>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, 
> LLC
> | 434.409.2780
> http://www.opensourceconnections.com


RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.
Generally, I'd recommend opening an issue on PDFBox's Jira with the file that 
you shared.  Tika uses PDFBox...if a fix can be made there, it will propagate 
back through Tika to Solr.

That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode 
mapping for CID+71 (71) in font 505Eddc6Arial

So, if the file has no Unicode mapping for the font, I doubt they'll be able to 
fix it.

pdftotext is also unable to extract anything useful from the file.

Sorry.

Best,

Tim
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, December 17, 2015 5:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Issues when indexing PDF files

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific 
> tools and change the encoding of the file?
> Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
in a way that Tika cannot easily extract the text, there's nothing you can do 
in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a character on a 
PDF page, but exactly how that character is generated (using a specific 
encoding, font, or even by drawing a picture) is outside your control. There 
are various businesses built on this premise
- they charge for creating clean extracted text from PDFs - and even they have 
trouble with some PDFs.

HTH

Charlie

>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> 
> wrote:
>
>> They could be using custom fonts and non-Unicode characters. That's 
>> probably something to explore with PDF specific tools.
>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>> wrote:
>>
>>> I've checked all the files which has problem with the content in the 
>>> Solr index using the Tika app. All of them shows the same issues as 
>>> what I see in the Solr index.
>>>
>>> So does the issues lies with the encoding of the file? Are we able 
>>> to
>> check
>>> the encoding of the file?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
>>> 
>>> wrote:
>>>
 Hi Erik,

 I've shared the file on dropbox, which you can access via the link
>> here:

>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?d
>> l=0

 This is what I get from the Tika app after dropping the file in.

 Content-Length: 75092
 Content-Type: application/pdf
 Type: COSName{Info}
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
 X-TIKA:digest:SHA256:
 d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
 access_permission:assemble_document: true
 access_permission:can_modify: true
 access_permission:can_print: true
 access_permission:can_print_degraded: true
 access_permission:extract_content: true
 access_permission:extract_for_accessibility: true
 access_permission:fill_in_form: true
 access_permission:modify_annotations: true
 dc:format: application/pdf; version=1.3
 pdf:PDFVersion: 1.3
 pdf:encrypted: false
 producer: null
 resourceName: Desmophen+670+BAe.pdf
 xmpTPg:NPages: 3


 Regards,
 Edwin


 On 17 December 2015 at 00:15, Erik Hatcher 
>>> wrote:

> Edwin - Can you share one of those PDF files?
>
> Also, drop the file into the Tika app and see what it sees 
> directly -
>>> get
> the tika-app JAR and run that desktop application.
>
> Could be an encoding issue?
>
>  Erik
>
> —
> Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com 
> 
>
>
>
>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I'm using Solr 5.3.0
>>
>> I'm indexing some PDF documents. However, for certain PDF files,
>> there
> are
>> chinese text in the documents, but after indexing, what is 
>> indexed
>> in
> the
>> content is either a series of "??" or an empty content.
>>
>> I'm using the post.jar that comes together with Solr.
>>
>> What could be the reason that causes this?
>>
>> Regards,
>> Edwin
>
>

>>>
>>
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: tikaparser docx file fails with exception

2015-11-06 Thread Allison, Timothy B.
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or 
POI's bugzilla...especially if you can share the triggering document.

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user 
Subject: Re: tikaparser docx file fails with exception

It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4 bytes are 
not supported: byte 0xb7 implies a length of more than 4 bytes
  at 
org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
  at 
org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader$FastStreamDecoder.read(XMLStreamReader.java:762)
  at 
org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read(XMLStreamReader.java:162)
  at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill(PiccoloLexer.java:3477)

If you search for something like: PiccoloLexer.yy_refill Characters larger than 
4 bytes are not supported:
you get lots of various matches in different forums for different (java-based? 
tika-based?) software. Most likely Tika found something obscure in the document 
that there is no implementations for yet. E.g.
an image inside a text field inside a footer section. Just as an example

I would basically try standalone Tika and look for the most expressive debug 
flag. It should tell you which file inside the zip that docx actually is caused 
the problem. That should give you some hint.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 5 November 2015 at 17:36, Aswath Srinivasan (TMS) 
 wrote:
> Thank you for attempting to answer. I will try out with solrj and standalone 
> java with tika parser. I completely understand that a bad document could 
> cause this, however, when I opened up the document I couldn't find anything 
> suspicious expect for some binary images/pictures embedded into the document.
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user 
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a 
> better idea of what's really going on, here's a SolrJ example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Wed, Nov 4, 2015 at 3:49 PM, Aswath Srinivasan (TMS) 
>  wrote:
>>
>> Trying to index a document. A docx file. Ending up with the below exception. 
>> Not sure why it is erroring out. When I opened the docx I was able to see 
>> lots of binary data like embedded pictures etc., Is there a possible 
>> solution to this or is it a bug? Only one such file fails. Rest of the files 
>> are smoothly indexed.
>>
>> 2015-11-04 23:16:11.549 INFO  (coreLoadExecutor-6-thread-1) [   x:tika] 
>> o.a.s.c.CoreContainer registering core: tika
>> 2015-11-04 23:16:11.549 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> QuerySenderListener sending requests to Searcher@1eb69b2[tika] 
>> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:11.585 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.c.S.Request [tika] webapp=null path=null 
>> params={q=static+firstSearcher+warming+in+solrconfig.xml=false=firstSearcher}
>>  hits=0 status=0 QTime=34
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> QuerySenderListener done.
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: wordbreak
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.h.c.SuggestComponent buildOnStartup: mySuggester
>> 2015-11-04 23:16:11.586 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] 
>> o.a.s.s.s.SolrSuggester SolrSuggester.build(mySuggester)
>> 2015-11-04 23:16:11.605 INFO  
>> (searcherExecutor-7-thread-1-processing-x:tika) [   x:tika] o.a.s.c.SolrCore 
>> [tika] Registered new searcher Searcher@1eb69b2[tika] 
>> main{ExitableDirectoryReader(UninvertingDirectoryReader())}
>> 2015-11-04 23:16:25.923 INFO  (qtp7980742-16) [   x:tika] 
>> o.a.s.h.d.DataImporter Loading DIH Configuration: tika-data-config.xml
>> 2015-11-04 23:16:25.937 INFO  (qtp7980742-16) [   x:tika] 
>> o.a.s.h.d.DataImporter Data Configuration 

  1   2   >