subject:"Indexing PDF"

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N

Thanks Erick... On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson wrote: > https://lucidworks.com/post/indexing-with-solrj/ > > > > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > > > Thanks Jorn and Erick. > > > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > > > Thanks

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson

https://lucidworks.com/post/indexing-with-solrj/ > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > Thanks Jorn and Erick. > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > Thanks > Fiz > > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson > wrote: > >> Here’s a

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N

Thanks Jorn and Erick. Hi Erick, looks like the skeletal SOLRJ program attachment is missing. Thanks Fiz On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson wrote: > Here’s a skeletal SolrJ program using Tika as another alternative. > > Best, > Erick > > > On Jun 7, 2020, at 2:06 PM, Jörn Franke

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson

Here’s a skeletal SolrJ program using Tika as another alternative. Best, Erick > On Jun 7, 2020, at 2:06 PM, Jörn Franke wrote: > > You have to write an external application that creates multiple threads, > parses the PDFs and index them in Solr. Ideally you parse the PDFs once and > store

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke

You have to write an external application that creates multiple threads, parses the PDFs and index them in Solr. Ideally you parse the PDFs once and store the resulting text on some file system and then index it. Reason is that if you upgrade to two major versions of Solr you might need to

Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N

Hello SOLR Experts, I am working on a POC to Index millions of PDF documents present in Multiple Folder in fileshare. Could you please let me the best practices and step to implement it. Thanks Fiz Nadiyal.

Re: Problem with SolrJ and indexing PDF files

2019-05-19 Thread Erick Erickson

Here’s a skeletal program to get you started using Tika directly in a SolrJ client, with a long explication of why using Solr’s extracting request handler is probably not what you want to do in production: https://lucidworks.com/2012/02/14/indexing-with-solrj/ SolrServer was renamed

Re: Problem with SolrJ and indexing PDF files

2019-05-19 Thread Jörn Franke

You can use the Tika library to parse the PDFs and then post the text to the Solr servers > Am 19.05.2019 um 11:02 schrieb Mareike Glock > : > > Dear Solr Team, > > I am trying to index Word and PDF documents with Solr using SolrJ, but most > of the examples I found on the internet use the

Problem with SolrJ and indexing PDF files

2019-05-19 Thread Mareike Glock

Dear Solr Team, I am trying to index Word and PDF documents with Solr using SolrJ, but most of the examples I found on the internet use the SolrServer class which I guess is deprecated. The connection to Solr itself is working, because I can add SolrInputDocuments to the index but it does not

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga

Yes, I know the reasons why put this work on a client rather than use Solr directly and it should be maybe the next my task. But I need to finish first my task - index a pdf files stored in SqlBase database. The pdf files are pretty simple, sometimes only dozens text lines. Regards, Aruna On

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Erick Erickson

For a lot of reasons, I greatly prefer to put this work on a client rather than use Solr directly. Here’s a place to get started, it connects to a DB and also scans local file directory for docs to push through (local) Tika and index. So you should be able to modify it relatively easily to get

Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga

Hello, I got a task to index in Solr 7.71 a PDF files which are stored in SqlBase database. I did half the job - I can to index all table fields, I can do a search in these fields except field in which is stored a pdf file content. As I am ttotally new in Solr, spent unsuccessfully a lot a time

RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden

mmit(solr, "prindex"); return true; -Original Message- From: Erick Erickson Sent: Wednesday, 31 October 2018 06:00 To: solr-user Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA All of the above work, but for robust production situations you'll wan

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread ☼ R Nair

to copy & paste it > to > > > the "Documents" tab in core solr. > > > The question is : > > > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only > > > with CLI mode ? if yes only with CLI mode, can you explain it to me >

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Erick Erickson

ally. And i just have to copy & paste it to > > the "Documents" tab in core solr. > > The question is : > > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only > > with CLI mode ? if yes only with CLI mode, can you explain it to me please &g

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Kamuela Lau

mode ? or is it only > with CLI mode ? if yes only with CLI mode, can you explain it to me please > ? > 2. Is it possible to add a text result in "Query" tab ?. > > The Background i asking about this is, i want to indexing PDF in my local > system, then i just upload

Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread adiyaksa kevin

F File to SOLR via TIKA with GUI mode ? or is it only with CLI mode ? if yes only with CLI mode, can you explain it to me please ? 2. Is it possible to add a text result in "Query" tab ?. The Background i asking about this is, i want to indexing PDF in my local system, then i just upl

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.

>http - however, the big advantage of doing your indexing on different machine >is that the heavy lifting that tika does in extracting text from documents, >finding metadata etc is not happening on the server. If the indexer crashes, >it doesn’t affect Solr either. +1 for what can go wrong:

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Phil Scadden

: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, 20 June 2017 11:29 p.m. To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context Dear Erick and Timothy, I also took a look at the Python clients (say, SolrClient

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.

Yeah, Chris knows a thing or two about Tika. :) -Original Message- From: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, June 20, 2017 8:00 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan

No intention of spamming but I also want to mention tika-python in the toolchain. Ziyuan On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan wrote: > Dear Erick and Timothy, > > I also took a look at the Python clients (say, SolrClient and

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan

Dear Erick and Timothy, I also took a look at the Python clients (say, SolrClient and pysolr) because Python is my main programming language. I have an impression that 1. they send HTTP requests to the server according to the server APIs; 2. they are not official and thus possibly not up to date.

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan

Dear Erick and Timothy, yes I will parse from the client for all the benefits. I am just trying to figure out what is going on by indexing one or two PDF files first. Thank you both. Best regards, Ziyuan On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson wrote: > bq:

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erick Erickson

bq: Hope that there is no side effect of not mapping the PDF Well, yes it will have that side effect. You can cure that with a copyField directive from content to _text_. But do really consider running this as a SolrJ program on the client. Tim knows in far more painful detail than I do what

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan

Hi Erick, Now it is clear. I have to update the request handler of /update/extract/ from "defaults":{"fmap.content":"_text_"} to "defaults":{"fmap.content":"content"} to fill the field. Hope that there is no side effect of not mapping the PDF content to _text_. Thank you for the hint. Best

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.

Finally, and I mean it this time, I heartily second Erik's point about SolrJ and the need to keep your file processing outside of Solr's JVM, VM and M! -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Monday, June 19, 2017 6:56 AM To: solr-user@lucene.apache.org Subj

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erik Hatcher

Ziyuan - You may be interested in the example/files that ships with Solr too. It’s got schema and config and even UI for file indexing and searching. Check it out README.txt under example/files in your Solr install. Erik > On Jun 19, 2017, at 6:52 AM, ZiYuan

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan

Hi Erick, thanks very much for the explanations! Clarification for question 2: more specifically I cannot see the field content in the returned JSON, with the the same definitions as in the post

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-18 Thread Erick Erickson

1> Yes, you can use your single definition. The author identifies the "text" field as a catch-all. Somewhere in the schema there'll be a copyField directive copying (perhaps) many different fields to the "text" field. That permits simple searches against a single field rather than, say, using

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-17 Thread ZiYuan

Hi, I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context. Now I am reading this post

Re: indexing pdf files using post tool

2016-03-19 Thread Francisco Andrés Fernández

erent fields in a document of solr according to data in it > like name;id;title;content etc > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > Sent from the Solr - User mailing list archive at Nabble.com. >

Re: indexing pdf files using post tool

2016-03-19 Thread Binoy Dalal

of solr according to data >> in it >> > like name;id;title;content etc >> > >> > Thanks >> > >> > >> > >> > -- >> > View this message in context: >> > >> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html >> > Sent from the Solr - User mailing list archive at Nabble.com. >> > >> > -- > Regards, > Binoy Dalal > -- Regards, Binoy Dalal

Re: indexing pdf files using post tool

2016-03-18 Thread Binoy Dalal

ording to data in > it > > like name;id;title;content etc > > > > Thanks > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > -- Regards, Binoy Dalal

Re: indexing pdf files using post tool

2016-03-18 Thread Jan Høydahl

a document of solr according to data in it > like name;id;title;content etc > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html > Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing pdf files using post tool

2016-03-16 Thread vidya

Sorry for conveying it in wrong way. I want my data of 1 pdf file to be indexed with different fields in a document of solr according to data in it like name;id;title;content etc Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool

Re: indexing pdf files using post tool

2016-03-15 Thread roshan agarwal

t; > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Roshan Agarwal Managing Director Siddhast IP Innovation (P) Ltd Phone: +

Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal

.How can I achieve this ? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Regards, Binoy Dalal

Re: indexing pdf files using post tool

2016-03-15 Thread vidya

Hi I got data into my content field. But i wanted to have differnt fields to be allocated for data in my file.How can I achieve this ? -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html Sent from the Solr - User mailing

Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal

289 > ], > "dc_format": [ > "application/pdf; version=1.4" > ], > "producer": [ > "wkhtmltopdf" > ], > "content_type": [ > "application/pdf" >

indexing pdf files using post tool

2016-03-15 Thread vidya

t_type": [ "application/pdf" ], "xmp_creatortool": [ "þÿ" ], "resourcename": [ "/root/solr/My_CV.pdf" ], "dc_title": [ "My CV"

indexing pdf binary stored in mongodb?

2016-02-05 Thread Arnett, Gabriel

Anyone have any experience indexing pdfs stored in binary form in mongodb? . Gabe Arnett Senior Director Moody's Analytics - The information contained in this e-mail message, and any attachment thereto, is

Re: indexing pdf binary stored in mongodb?

2016-02-05 Thread Jack Krupansky

See if they are stored in BSON format using GridFS. If so, you can simply use the mongofiles command to retrieve the PDF into a local file and index that in Solr either using Solr Cell or Tika. See: http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb

Re: solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Binoy Dalal

Do you see any exceptions in the solr log? On Sat, 23 Jan 2016, 16:29 Moncif Aidi wrote: > HI, > > I have a problem with integrating solr in Ubuntu server.Before using solr > on ubuntu server i tested it on my mac it was working perfectly. it indexed > my PDF,Doc,Docx

solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Moncif Aidi

HI, I have a problem with integrating solr in Ubuntu server.Before using solr on ubuntu server i tested it on my mac it was working perfectly. it indexed my PDF,Doc,Docx documents.so after installing solr on ubuntu server and using the same configuration files and librairies. i've found out that

Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo

Thanks for all your replies. I did chance upon this question from stackoverflow which it says is able to solve the issues: http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/ However, when I tried to run it, it still get the same "?" output in the content, the

Re: Issues when indexing PDF files

2015-12-18 Thread Erick Erickson

This could also simply be your browser isn't set up to display UTF-8, the characters may be just fine. Best, Erick On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo wrote: > Thanks for all your replies. > > I did chance upon this question from stackoverflow which it

Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo

Hi Erick, Thanks for your reply. However, it is unlikely to be the browser issue, as the same result occurs when I tried it in the Tika app. Regards, Edwin On 18 December 2015 at 23:39, Erick Erickson wrote: > This could also simply be your browser isn't set up to >

Re: Issues when indexing PDF files

2015-12-17 Thread Zheng Lin Edwin Yeo

Hi Alexandre, Thanks for your reply. So the only way to solve this issue is to explore with PDF specific tools and change the encoding of the file? Is there any way to configure it in Solr? Regards, Edwin On 17 December 2015 at 15:42, Alexandre Rafalovitch wrote: > They

Re: Issues when indexing PDF files

2015-12-17 Thread Binoy Dalal

You can always write an update handler plugin to convert your PDFs to utf-8 and then push them to solr On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific

RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.

17, 2015 5:48 AM To: solr-user@lucene.apache.org Subject: Re: Issues when indexing PDF files On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific > tools and change the en

Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood

PDF isn’t really text. For example, it doesn’t have spaces, it just moves the next letter over farther. Letters might not be in reading order — two column text could be printed as horizontal scans. Custom fonts might not use an encoding that matches Unicode, which makes them encrypted (badly).

Re: Issues when indexing PDF files

2015-12-17 Thread Charlie Hull

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: Hi Alexandre, Thanks for your reply. So the only way to solve this issue is to explore with PDF specific tools and change the encoding of the file? Is there any way to configure it in Solr? Solr uses Tika to extract plain text from PDFs. If the

Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo

I've checked all the files which has problem with the content in the Solr index using the Tika app. All of them shows the same issues as what I see in the Solr index. So does the issues lies with the encoding of the file? Are we able to check the encoding of the file? Regards, Edwin On 17

Re: Issues when indexing PDF files

2015-12-16 Thread Alexandre Rafalovitch

They could be using custom fonts and non-Unicode characters. That's probably something to explore with PDF specific tools. On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" wrote: > I've checked all the files which has problem with the content in the Solr > index using the Tika

Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo

Hi, I'm using Solr 5.3.0 I'm indexing some PDF documents. However, for certain PDF files, there are chinese text in the documents, but after indexing, what is indexed in the content is either a series of "??" or an empty content. I'm using the post.jar that comes together with Solr. What

Re: Issues when indexing PDF files

2015-12-16 Thread Erik Hatcher

Edwin - Can you share one of those PDF files? Also, drop the file into the Tika app and see what it sees directly - get the tika-app JAR and run that desktop application. Could be an encoding issue? Erik — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com

Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo

Hi Erik, I've shared the file on dropbox, which you can access via the link here: https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 This is what I get from the Tika app after dropping the file in. Content-Length: 75092 Content-Type: application/pdf Type: COSName{Info}

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

Reddy [mailto:vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.

+1 :) PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-)

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy

Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if

Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl

Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy

[mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy

are always working to improve it. Best, Tim -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]

and httpd, at least to me. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, April 16, 2015 7:53 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Hi Vijay, I know the this road too well :-) For PDF you can fallback

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]

@lucene.apache.org Subject: RE: Indexing PDF and MS Office files +1 :) PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-)

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy

, but we are always working to improve it. Best, Tim -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull

On 16/04/2015 12:53, Siegfried Goeschl wrote: Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) Here's some file

Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood

Turning PDF back into a structured document is like trying to turn hamburger back into a cow. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. talli...@mitre.org wrote: +1 :) PS: one more thing -

Re: Indexing PDF and MS Office files

2015-04-15 Thread Erick Erickson

There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't

Re: Indexing PDF and MS Office files

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy

Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy

Hi, Here are the solr-config xml and the error log from Solr logs for your reference. As mentioned earlier, I didnt make any changes to the solr-config.xml as I am using the xml file out of the box one that came with the default installation. Please let me know your thoughts on why these issues

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy

Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below.

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

It seems something like https://issues.apache.org/jira/browse/TIKA-1251. I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems to be fixed in Tika 1.6. I agree with Erik: you should try with another version of Tika. Best, Andrea On 04/14/2015 06:44 PM, Vijaya Narayana Reddy

Re: Indexing PDF and MS Office files

2015-04-14 Thread Erick Erickson

looks like this is just a file that Tika can't handle, based on this line: bq: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser You might be able to get some joy from parsing this from Java and see if a more recent Tika would

Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy

Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi Vijay, Please paste an extract of your schema, where the content field (the field where the PDF text shoudl be) and its type are declared. For the other issue, please paste the whole stacktrace because org.apache.tika.parser.microsoft.OfficeParser* says nothing. The complete stacktrace (or

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi, solrconfig.xml (especially if you didn't touch it) should be good. What about the schema? Are you using the one that comes with the download bundle, too? I don't see the stacktrace..did you forget to paste it? Best, Andrea On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy

Re: Indexing PDF and MS Office files

2015-04-14 Thread Jack Krupansky

Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content

Re: Indexing PDF and MS Office files

2015-04-14 Thread Shyam R

Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an

Re: Indexing PDF and MS Office files

2015-04-14 Thread Terry Rhodes

Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to

Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-14 Thread vignesh

Dear Team, I am Vignesh using the latest version 4.8.0 Apache Solr and am Indexing my PDF but getting an error and have posted that below for your reference. Kindly guide me to solve this error. D:\IPCB\solrjava -Durl=http://localhost:8082/solr/ipcb/update/extract -Dparams=

Re: Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-12 Thread Siegfried Goeschl

Hi Vignesh, can you check your SOLR Server Log?! Not all PDF documents on this planet can be processed using Tikka :-) Cheers, Siegfried Goeschl On 07 May 2014, at 09:40, vignesh vignes...@ninestars.in wrote: Dear Team, I am Vignesh using the latest version 4.8.0 Apache Solr

Re: Indexing pdf files - question.

2013-09-08 Thread Nutan Shinde

Error got resolved,solution was dynamic field / must be within fields tag. On Sun, Sep 8, 2013 at 3:31 AM, Furkan KAMACI furkankam...@gmail.comwrote: Could you show us logs you get when you start your web container? 2013/9/4 Nutan Shinde nutanshinde1...@gmail.com My solrconfig.xml is:

Re: Indexing pdf files - question.

2013-09-07 Thread Furkan KAMACI

Could you show us logs you get when you start your web container? 2013/9/4 Nutan Shinde nutanshinde1...@gmail.com My solrconfig.xml is: requestHandler name=/update/extract class=solr.extraction.ExtractingRequestHandler lst name=defaults str name=fmap.contentdesc/str !-to map this

Re: Indexing pdf files - question.

2013-09-04 Thread Nutan Shinde

My solrconfig.xml is: requestHandler name=/update/extract class=solr.extraction.ExtractingRequestHandler lst name=defaults str name=fmap.contentdesc/str !-to map this field of my table which is defined as shown below in schem.xml-- str name=lowernamestrue/str str name=uprefixattr_/str

Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112

Can you please suggest a way (with example) of assigning this unique key to a pdf file? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074588.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112

Okay. Can you please suggest a way (with an example) of assigning this unique key to a pdf file. Say, a unique number to each pdf file. How do i achieve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html

Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar

) of assigning this unique key to a pdf file. Say, a unique number to each pdf file. How do i achieve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html Sent from the Solr - User mailing list archive

Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112

Yes. The absolute path is unique. -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074620.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112

Yes. The absolute path is unique. How do i implement it? can you please explain? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar

archit2...@gmail.com wrote: Yes. The absolute path is unique. How do i implement it? can you please explain? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html Sent from the Solr - User mailing list archive

Unique key error while indexing pdf files

2013-07-01 Thread archit2112

of a document. how do i define the id field (unique key) of a pdf file. how do i solve this problem? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html Sent from the Solr - User mailing list archive

Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky

Krupansky -Original Message- From: archit2112 Sent: Monday, July 01, 2013 7:17 AM To: solr-user@lucene.apache.org Subject: Unique key error while indexing pdf files Hi Im trying to index pdf files in solr 4.3.0 using the data import handler. *My request handler - * requestHandler name

Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112

this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky

AM To: solr-user@lucene.apache.org Subject: Re: Unique key error while indexing pdf files Im new to solr. Im just trying to understand and explore various features offered by solr and their implementations. I would be very grateful if you could solve my problem with any example of your choice. I

Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-10 Thread Mark Wilson

Hi Michael Thanks very much for that, it did indeed solve the problem. I had it setup on my internal servers, as I have a separate script for tomcat startup, but forgot all about it on the Amazon Cloud servers. For info I added CATALINA_OPTS=-Djava.awt.headless=true export CATALINA_OPTS to

Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-10 Thread Michael Della Bitta

Glad that helped. I'm going to go buy a lottery ticket now! :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+:

Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Mark Wilson

Hi I am having an issue with adding pdf documents to a SolrCloud index I have setup. I can index pdf documents fine using 4.3.0 on my local box, but I have a SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get Error. It seems that it is not loading

Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Michael Della Bitta

Hi Mark, This is a total shot in the dark, but does passing -Djava.awt.headless=true when you run the server help at all? More on awt headless mode: http://www.oracle.com/technetwork/articles/javase/headless-136834.html Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1

1 2 >

1 - 100 of 163 matches

Mail list logo