Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Erick...

On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson 
wrote:

> https://lucidworks.com/post/indexing-with-solrj/
>
>
> > On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> >
> > Thanks Jorn and Erick.
> >
> > Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> >
> > Thanks
> > Fiz
> >
> > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> > wrote:
> >
> >> Here’s a skeletal SolrJ program using Tika as another alternative.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >>>
> >>> You have to write an external application that creates multiple
> threads,
> >> parses the PDFs and index them in Solr. Ideally you parse the PDFs once
> and
> >> store the resulting text on some file system and then index it. Reason
> is
> >> that if you upgrade to two major versions of Solr you might need to
> reindex
> >> again. Then you can save time because you don’t need to parse the PDFs
> >> again.
> >>> It can be also useful in case you are not sure yet about the final
> >> schema and need to index several times in different schemas etc
> >>>
> >>> You can also use Apache manifoldCF.
> >>>
> >>>
> >>>
>  Am 07.06.2020 um 19:19 schrieb Fiz N :
> 
>  Hello SOLR Experts,
> 
>  I am working on a POC to Index millions of PDF documents present in
>  Multiple Folder in fileshare.
> 
>  Could you please let me the best practices and step to implement it.
> 
>  Thanks
>  Fiz Nadiyal.
> >>
> >>
>
>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
https://lucidworks.com/post/indexing-with-solrj/


> On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> 
> Thanks Jorn and Erick.
> 
> Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> 
> Thanks
> Fiz
> 
> On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> wrote:
> 
>> Here’s a skeletal SolrJ program using Tika as another alternative.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
>>> 
>>> You have to write an external application that creates multiple threads,
>> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
>> store the resulting text on some file system and then index it. Reason is
>> that if you upgrade to two major versions of Solr you might need to reindex
>> again. Then you can save time because you don’t need to parse the PDFs
>> again.
>>> It can be also useful in case you are not sure yet about the final
>> schema and need to index several times in different schemas etc
>>> 
>>> You can also use Apache manifoldCF.
>>> 
>>> 
>>> 
 Am 07.06.2020 um 19:19 schrieb Fiz N :
 
 Hello SOLR Experts,
 
 I am working on a POC to Index millions of PDF documents present in
 Multiple Folder in fileshare.
 
 Could you please let me the best practices and step to implement it.
 
 Thanks
 Fiz Nadiyal.
>> 
>> 



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Jorn and Erick.

Hi Erick, looks like the skeletal SOLRJ program attachment is missing.

Thanks
Fiz

On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
wrote:

> Here’s a skeletal SolrJ program using Tika as another alternative.
>
> Best,
> Erick
>
> > On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >
> > You have to write an external application that creates multiple threads,
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
> store the resulting text on some file system and then index it. Reason is
> that if you upgrade to two major versions of Solr you might need to reindex
> again. Then you can save time because you don’t need to parse the PDFs
> again.
> > It can be also useful in case you are not sure yet about the final
> schema and need to index several times in different schemas etc
> >
> > You can also use Apache manifoldCF.
> >
> >
> >
> >> Am 07.06.2020 um 19:19 schrieb Fiz N :
> >>
> >> Hello SOLR Experts,
> >>
> >> I am working on a POC to Index millions of PDF documents present in
> >> Multiple Folder in fileshare.
> >>
> >> Could you please let me the best practices and step to implement it.
> >>
> >> Thanks
> >> Fiz Nadiyal.
>
>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
Here’s a skeletal SolrJ program using Tika as another alternative.

Best,
Erick

> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> 
> You have to write an external application that creates multiple threads, 
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and 
> store the resulting text on some file system and then index it. Reason is 
> that if you upgrade to two major versions of Solr you might need to reindex 
> again. Then you can save time because you don’t need to parse the PDFs again. 
> It can be also useful in case you are not sure yet about the final schema and 
> need to index several times in different schemas etc
> 
> You can also use Apache manifoldCF.
> 
> 
> 
>> Am 07.06.2020 um 19:19 schrieb Fiz N :
>> 
>> Hello SOLR Experts,
>> 
>> I am working on a POC to Index millions of PDF documents present in
>> Multiple Folder in fileshare.
>> 
>> Could you please let me the best practices and step to implement it.
>> 
>> Thanks
>> Fiz Nadiyal.



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke
You have to write an external application that creates multiple threads, parses 
the PDFs and index them in Solr. Ideally you parse the PDFs once and store the 
resulting text on some file system and then index it. Reason is that if you 
upgrade to two major versions of Solr you might need to reindex again. Then you 
can save time because you don’t need to parse the PDFs again. 
It can be also useful in case you are not sure yet about the final schema and 
need to index several times in different schemas etc

You can also use Apache manifoldCF.



> Am 07.06.2020 um 19:19 schrieb Fiz N :
> 
> Hello SOLR Experts,
> 
> I am working on a POC to Index millions of PDF documents present in
> Multiple Folder in fileshare.
> 
> Could you please let me the best practices and step to implement it.
> 
> Thanks
> Fiz Nadiyal.


Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Hello SOLR Experts,

I am working on a POC to Index millions of PDF documents present in
Multiple Folder in fileshare.

Could you please let me the best practices and step to implement it.

Thanks
Fiz Nadiyal.


Re: Problem with SolrJ and indexing PDF files

2019-05-19 Thread Erick Erickson
Here’s a skeletal program to get you started using Tika directly in a SolrJ 
client, with a long explication of why using Solr’s extracting request handler 
is probably not what you want to do in production: 

https://lucidworks.com/2012/02/14/indexing-with-solrj/

SolrServer was renamed SolrClient 4 1/2 years ago, one of my pet peeves is that 
lots of pages don’t have dates attached. The link above was updated after this 
change even though it was published in 2012, but even so you’ll find some 
methods that have since been deprecated.

If you’re using SolrCloud, you should be using CloudSolrClient rather than 
SolrClient.

Best,
Erick

> On May 19, 2019, at 5:07 AM, Jörn Franke  wrote:
> 
> You can use the Tika library to parse the PDFs and then post the text to the 
> Solr servers
> 
>> Am 19.05.2019 um 11:02 schrieb Mareike Glock 
>> :
>> 
>> Dear Solr Team,
>> 
>> I am trying to index Word and PDF documents with Solr using SolrJ, but most 
>> of the examples I found on the internet use the SolrServer class which I 
>> guess is deprecated. 
>> The connection to Solr itself is working, because I can add 
>> SolrInputDocuments to the index but it does not work for rich documents 
>> because I get an exception.
>> 
>> 
>> public static void main(String[] args) throws IOException, 
>> SolrServerException {
>>String urlString = "http://localhost:8983/solr/localDocs16;;
>>HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
>> 
>>//is working
>>for(int i=0;i<1000;++i) {
>>SolrInputDocument doc = new SolrInputDocument();
>>doc.addField("cat", "book");
>>doc.addField("id", "book-" + i);
>>doc.addField("name", "The Legend of the Hobbit part " + i);
>>solr.add(doc);
>>if(i%100==0) solr.commit();  // periodically flush
>>}
>> 
>>//is not working
>>File file = new File("path\\testfile.pdf");
>> 
>>ContentStreamUpdateRequest req = new 
>> ContentStreamUpdateRequest("update/extract");
>> 
>>req.addFile(file, "application/pdf");
>>req.setParam("literal.id", "doc1");
>>req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>try{
>>solr.request(req);
>>}
>>catch(IOException e){
>>PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>e.printStackTrace(out);
>>out.close();
>>System.out.println("IO message: " + e.getMessage());
>>} catch(SolrServerException e){
>>PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>e.printStackTrace(out);
>>out.close();
>>System.out.println("SolrServer message: " + e.getMessage());
>>} catch(Exception e){
>>PrintWriter out = new 
>> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
>>e.printStackTrace(out);
>>out.close();
>>System.out.println("UnknownException message: " + e.getMessage());
>>}finally{
>>solr.commit();
>>}
>> }
>> 
>> 
>> I am using Maven (pom.xml attached) and created a JAR file, which I then 
>> tried to execute from the command line, and this is the output I get:
>> 
>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>> details.
>>SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
>>SLF4J: Defaulting to no-operation MDCAdapter implementation.
>>SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
>> further details.
>>message: UnknownException message: Error from server at 
>> http://localhost:8983/solr/localDocs17: Bad contentType for search handler 
>> :application/pdf request={wt=javabin=2}
>> 
>> 
>> 
>> 
>> 
>> I hope you may be able to help me with this. I also posted this issue on 
>> Github.
>> 
>> Cheers,
>> Mareike Glock
>> 
>> 



Re: Problem with SolrJ and indexing PDF files

2019-05-19 Thread Jörn Franke
You can use the Tika library to parse the PDFs and then post the text to the 
Solr servers

> Am 19.05.2019 um 11:02 schrieb Mareike Glock 
> :
> 
> Dear Solr Team,
> 
> I am trying to index Word and PDF documents with Solr using SolrJ, but most 
> of the examples I found on the internet use the SolrServer class which I 
> guess is deprecated. 
> The connection to Solr itself is working, because I can add 
> SolrInputDocuments to the index but it does not work for rich documents 
> because I get an exception.
> 
> 
> public static void main(String[] args) throws IOException, 
> SolrServerException {
> String urlString = "http://localhost:8983/solr/localDocs16;;
> HttpSolrClient solr = new HttpSolrClient.Builder(urlString).build();
> 
> //is working
> for(int i=0;i<1000;++i) {
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("cat", "book");
> doc.addField("id", "book-" + i);
> doc.addField("name", "The Legend of the Hobbit part " + i);
> solr.add(doc);
> if(i%100==0) solr.commit();  // periodically flush
> }
> 
> //is not working
> File file = new File("path\\testfile.pdf");
> 
> ContentStreamUpdateRequest req = new 
> ContentStreamUpdateRequest("update/extract");
> 
> req.addFile(file, "application/pdf");
> req.setParam("literal.id", "doc1");
> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> try{
> solr.request(req);
> }
> catch(IOException e){
> PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("IO message: " + e.getMessage());
> } catch(SolrServerException e){
> PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("SolrServer message: " + e.getMessage());
> } catch(Exception e){
> PrintWriter out = new 
> PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");
> e.printStackTrace(out);
> out.close();
> System.out.println("UnknownException message: " + e.getMessage());
> }finally{
> solr.commit();
> }
> }
> 
> 
> I am using Maven (pom.xml attached) and created a JAR file, which I then 
> tried to execute from the command line, and this is the output I get:
> 
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
> SLF4J: Defaulting to no-operation MDCAdapter implementation.
> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
> further details.
> message: UnknownException message: Error from server at 
> http://localhost:8983/solr/localDocs17: Bad contentType for search handler 
> :application/pdf request={wt=javabin=2}
> 
> 
> 
> 
> 
> I hope you may be able to help me with this. I also posted this issue on 
> Github.
> 
> Cheers,
> Mareike Glock
> 
> 


Problem with SolrJ and indexing PDF files

2019-05-19 Thread Mareike Glock

Dear Solr Team,

I am trying to index Word and PDF documents with Solr using SolrJ, but 
most of the examples I found on the internet use the SolrServer class 
which I guess is deprecated.
The connection to Solr itself is working, because I can add 
SolrInputDocuments to the index but it does not work for rich documents 
because I get an exception.



public static void main(String[] args) throws IOException, 
SolrServerException {

String urlString = "http://localhost:8983/solr/localDocs16;;
HttpSolrClient solr = new 
HttpSolrClient.Builder(urlString).build();


//is working
for(int i=0;i<1000;++i) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("cat", "book");
doc.addField("id", "book-" + i);
doc.addField("name", "The Legend of the Hobbit part " + i);
solr.add(doc);
if(i%100==0) solr.commit();  // periodically flush
}

//is not working
File file = new File("path\\testfile.pdf");

ContentStreamUpdateRequest req = new 
ContentStreamUpdateRequest("update/extract");


req.addFile(file, "application/pdf");
req.setParam("literal.id", "doc1");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
try{
solr.request(req);
}
catch(IOException e){
PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");

e.printStackTrace(out);
out.close();
System.out.println("IO message: " + e.getMessage());
} catch(SolrServerException e){
PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");

e.printStackTrace(out);
out.close();
System.out.println("SolrServer message: " + e.getMessage());
} catch(Exception e){
PrintWriter out = new 
PrintWriter("C:\\Users\\mareike\\Desktop\\filename.txt");

e.printStackTrace(out);
out.close();
System.out.println("UnknownException message: " + 
e.getMessage());

}finally{
solr.commit();
}
}


I am using Maven (pom.xml attached) and created a JAR file, which I then 
tried to execute from the command line, and this is the output I get:


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.

SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
message: *UnknownException message: Error from server at 
http://localhost:8983/solr/localDocs17: Bad contentType for search 
handler :application/pdf request={wt=javabin=2}*




I hope you may be able to help me with this. I also posted this issue on 
Github 
.


Cheers,
Mareike Glock

http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd;>
  4.0.0
  com.mycompany.app
  solr-search
  jar
  1.0
  solr-search
  http://maven.apache.org

  
1.7
1.7
  

  

  
maven-assembly-plugin

  

  com.mycompany.app.Main

  
  
jar-with-dependencies
  

  

  

  

  junit
  junit
  3.8.1
  test


  org.apache.solr
  solr-solrj
  7.7.0

  




Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga
Yes, I know the reasons why put this work on a client rather than use Solr
directly and it should be maybe the next my task.
But I need to finish first my task - index a pdf files stored in SqlBase
database. The pdf files are pretty simple, sometimes only dozens text lines.

Regards,

Aruna

On Wed, Apr 3, 2019 at 5:03 PM Erick Erickson 
wrote:

> For a lot of reasons, I greatly prefer to put this work on a client rather
> than use Solr directly. Here’s a place to get started, it connects to a DB
> and also scans local file directory for docs to push through (local) Tika
> and index. So you should be able to modify it relatively easily to get the
> data from SqlBase, read the associated PDF, combine the two and send to
> Solr.
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> The code itself is a bit old, but illustrates the process.
>
> Best,
> Erick
>
> > On Apr 2, 2019, at 11:46 PM, Arunas Spurga  wrote:
> >
> > Hello,
> >
> > I got a task to index in Solr 7.71 a PDF files which are stored in
> SqlBase
> > database. I did half the job - I can to index all table fields, I can do
> a
> > search in these fields except field in which is stored a pdf file
> content.
> > As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
> > understand how to force to extract and index field with pdf content. I
> need
> > a help.
> >
> > Regards,
> >
> > Aruna
> >
> > in solrconfig.xml i have
> >
> >
> > *  dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib"
> > regex=".*\.jar" />   > regex="solr-dataimporthandler-.*\.jar" /> *
> > *   > regex=".*\.jar" />*
> > *   > regex="solr-cell-\d.*\.jar" />*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > * > startup="lazy"
> > class="solr.extraction.ExtractingRequestHandler" > > name="defaults">  true   > name="fmap.meta">ignored_   > name="fmap.content">_text_  *
> >
> >
> >
> >
> >
> > * > class="org.apache.solr.handler.dataimport.DataImportHandler">> name="defaults">db-data-config.xml   
> > *
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> *-db-data-config.xml > type="JdbcDataSource"
> > driver="jdbc.unify.sqlbase.SqlbaseDriver"
> > url="jdbc:sqlbase://localhost:2155/PDFDOCS"
> > user="sysadm"password="sysadm" />  > name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
> >  > name="PDF" />
> > *
>
>


Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Erick Erickson
For a lot of reasons, I greatly prefer to put this work on a client rather than 
use Solr directly. Here’s a place to get started, it connects to a DB and also 
scans local file directory for docs to push through (local) Tika and index. So 
you should be able to modify it relatively easily to get the data from SqlBase, 
read the associated PDF, combine the two and send to Solr.

https://lucidworks.com/2012/02/14/indexing-with-solrj/

The code itself is a bit old, but illustrates the process.

Best,
Erick

> On Apr 2, 2019, at 11:46 PM, Arunas Spurga  wrote:
> 
> Hello,
> 
> I got a task to index in Solr 7.71 a PDF files which are stored in SqlBase
> database. I did half the job - I can to index all table fields, I can do a
> search in these fields except field in which is stored a pdf file content.
> As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
> understand how to force to extract and index field with pdf content. I need
> a help.
> 
> Regards,
> 
> Aruna
> 
> in solrconfig.xml i have
> 
> 
> *  regex=".*\.jar" />   regex="solr-dataimporthandler-.*\.jar" /> *
> *   regex=".*\.jar" />*
> *   regex="solr-cell-\d.*\.jar" />*
> 
> 
> 
> 
> 
> 
> 
> 
> 
> * startup="lazy"
> class="solr.extraction.ExtractingRequestHandler" > name="defaults">  true   name="fmap.meta">ignored_   name="fmap.content">_text_  *
> 
> 
> 
> 
> 
> * class="org.apache.solr.handler.dataimport.DataImportHandler">name="defaults">db-data-config.xml   
> *
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *-db-data-config.xml type="JdbcDataSource"
> driver="jdbc.unify.sqlbase.SqlbaseDriver"
> url="jdbc:sqlbase://localhost:2155/PDFDOCS"
> user="sysadm"password="sysadm" />  name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
>  name="PDF" />
> *



Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga
Hello,

I got a task to index in Solr 7.71 a PDF files which are stored in SqlBase
database. I did half the job - I can to index all table fields, I can do a
search in these fields except field in which is stored a pdf file content.
As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
understand how to force to extract and index field with pdf content. I need
a help.

Regards,

Aruna

in solrconfig.xml i have


**
*  *
*  *









*  true  ignored_  _text_  *





*   db-data-config.xml   
*



















*-db-data-config.xml 
 
*


RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden
I will second the SolrJ method. You don’t want to be doing this on your SOLR 
instance. One question is whether your PDFs are scanned or are already 
searchable. I use tesseract offline to convert all scanned PDFs into searchable 
PDF so I don’t want Tika to be doing that. My code core is:
File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR); 
// Remove this line (in fact remove the whole pdfparserConfig if you want tika 
to OCR
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   e.printStackTrace();
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath()); // load up whatever fields 
you are using
 up.addField("location",idString);
 up.addField("access",access);
 up.addField("datasource",datasource);
 up.addField("title",title);
 up.addField("author",author);
 if (year>0) up.addField("year",year);
 if (opfyear>0) up.addField("opfyear",opfyear);
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"prindex");
 req.commit(solr, "prindex");
 return true;

-Original Message-
From: Erick Erickson 
Sent: Wednesday, 31 October 2018 06:00
To: solr-user 
Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA

All of the above work, but for robust production situations you'll want to 
consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines 
indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau  wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to
> upload a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-sto
> re-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you
> define and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-up
> load
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin
> 
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin
> > Putra (you can call me Kevin), from Indonesia, i am a beginner in
> > backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache 
> > TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache
> > TIKA. I understand how SOLR or TIKA works, but i don't know how they
> > both integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse
> > it into data/meta data automatically. And i just have to copy &
> > paste it to the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it
> > only with CLI mode ? if yes only with CLI mode, can you explain it
> > to me please ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my
> > local system, then i just upload it like "drag 

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread ☼ R Nair
I have done a production implementation of this, running for last four
months without any issue. Just a resatrt every week of all components.

http://blog.cloudera.com/blog/2015/10/how-to-index-scanned-pdfs-at-scale-using-fewer-than-50-lines-of-code/


Best, Ravion

On Tue, Oct 30, 2018, 1:00 PM Erick Erickson 
wrote:

> All of the above work, but for robust production situations you'll
> want to consider a SolrJ client, see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog
> combines indexing from a DB and using Tika, but those are independent.
>
> Best,
> Erick
> On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau 
> wrote:
> >
> > Hi there,
> >
> > Here are a couple of ways I'm aware of:
> >
> > 1. Extract-handler / post tool
> > You can use the curl command with the extract handler or bin/post to
> upload
> > a single document.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > 2. DataImportHandler
> > This could be used for, say, uploading multiple documents with Tika.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
> >
> > You should also be able to do it via the admin page, so long as you
> define
> > and modify the extract handler in solrconfig.xml.
> > Reference:
> >
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload
> >
> > Hope this helps!
> >
> > On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin 
> > wrote:
> >
> > > Hello there, let me introduce my self. My name is Mohammad Kevin Putra
> (you
> > > can call me Kevin), from Indonesia, i am a beginner in backend
> developer, i
> > > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
> > >
> > > I have a little bit problem about how to put PDF File via Apache TIKA.
> I
> > > understand how SOLR or TIKA works, but i don't know how they both
> > > integrated.
> > > Last thing i know, TIKA can extract the PDF file i upload, and parse it
> > > into data/meta data automatically. And i just have to copy & paste it
> to
> > > the "Documents" tab in core solr.
> > > The question is :
> > > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> > > with CLI mode ? if yes only with CLI mode, can you explain it to me
> please
> > > ?
> > > 2. Is it possible to add a text result in "Query" tab ?.
> > >
> > > The Background i asking about this is, i want to indexing PDF in my
> local
> > > system, then i just upload it like "drag & drop" in SOLR (is it
> possible ?)
> > > then when i type something in search box the result is like this :
> > > (Title of doc)
> > > blablablabla (yellow stabilo result) blablabla.
> > > the blablabla text is like a couple sentences. That's all i need.
> > > Sorry for my bad english.
> > > Thanks for reading and replying this for me, it will be very helpful
> to me.
> > > Thanks a lot
> > >
>


Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Erick Erickson
All of the above work, but for robust production situations you'll
want to consider a SolrJ client, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog
combines indexing from a DB and using Tika, but those are independent.

Best,
Erick
On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau  wrote:
>
> Hi there,
>
> Here are a couple of ways I'm aware of:
>
> 1. Extract-handler / post tool
> You can use the curl command with the extract handler or bin/post to upload
> a single document.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html
>
> 2. DataImportHandler
> This could be used for, say, uploading multiple documents with Tika.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor
>
> You should also be able to do it via the admin page, so long as you define
> and modify the extract handler in solrconfig.xml.
> Reference:
> https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload
>
> Hope this helps!
>
> On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin 
> wrote:
>
> > Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
> > can call me Kevin), from Indonesia, i am a beginner in backend developer, i
> > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
> >
> > I have a little bit problem about how to put PDF File via Apache TIKA. I
> > understand how SOLR or TIKA works, but i don't know how they both
> > integrated.
> > Last thing i know, TIKA can extract the PDF file i upload, and parse it
> > into data/meta data automatically. And i just have to copy & paste it to
> > the "Documents" tab in core solr.
> > The question is :
> > 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> > with CLI mode ? if yes only with CLI mode, can you explain it to me please
> > ?
> > 2. Is it possible to add a text result in "Query" tab ?.
> >
> > The Background i asking about this is, i want to indexing PDF in my local
> > system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
> > then when i type something in search box the result is like this :
> > (Title of doc)
> > blablablabla (yellow stabilo result) blablabla.
> > the blablabla text is like a couple sentences. That's all i need.
> > Sorry for my bad english.
> > Thanks for reading and replying this for me, it will be very helpful to me.
> > Thanks a lot
> >


Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Kamuela Lau
Hi there,

Here are a couple of ways I'm aware of:

1. Extract-handler / post tool
You can use the curl command with the extract handler or bin/post to upload
a single document.
Reference:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html

2. DataImportHandler
This could be used for, say, uploading multiple documents with Tika.
Reference:
https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-data-with-the-data-import-handler.html#the-tikaentityprocessor

You should also be able to do it via the admin page, so long as you define
and modify the extract handler in solrconfig.xml.
Reference:
https://lucene.apache.org/solr/guide/7_5/documents-screen.html#file-upload

Hope this helps!

On Tue, Oct 30, 2018 at 3:40 PM adiyaksa kevin 
wrote:

> Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
> can call me Kevin), from Indonesia, i am a beginner in backend developer, i
> use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.
>
> I have a little bit problem about how to put PDF File via Apache TIKA. I
> understand how SOLR or TIKA works, but i don't know how they both
> integrated.
> Last thing i know, TIKA can extract the PDF file i upload, and parse it
> into data/meta data automatically. And i just have to copy & paste it to
> the "Documents" tab in core solr.
> The question is :
> 1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
> with CLI mode ? if yes only with CLI mode, can you explain it to me please
> ?
> 2. Is it possible to add a text result in "Query" tab ?.
>
> The Background i asking about this is, i want to indexing PDF in my local
> system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
> then when i type something in search box the result is like this :
> (Title of doc)
> blablablabla (yellow stabilo result) blablabla.
> the blablabla text is like a couple sentences. That's all i need.
> Sorry for my bad english.
> Thanks for reading and replying this for me, it will be very helpful to me.
> Thanks a lot
>


Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread adiyaksa kevin
Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you
can call me Kevin), from Indonesia, i am a beginner in backend developer, i
use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0.

I have a little bit problem about how to put PDF File via Apache TIKA. I
understand how SOLR or TIKA works, but i don't know how they both
integrated.
Last thing i know, TIKA can extract the PDF file i upload, and parse it
into data/meta data automatically. And i just have to copy & paste it to
the "Documents" tab in core solr.
The question is :
1. can i upload PDF File to SOLR via TIKA with GUI mode ? or is it only
with CLI mode ? if yes only with CLI mode, can you explain it to me please ?
2. Is it possible to add a text result in "Query" tab ?.

The Background i asking about this is, i want to indexing PDF in my local
system, then i just upload it like "drag & drop" in SOLR (is it possible ?)
then when i type something in search box the result is like this :
(Title of doc)
blablablabla (yellow stabilo result) blablabla.
the blablabla text is like a couple sentences. That's all i need.
Sorry for my bad english.
Thanks for reading and replying this for me, it will be very helpful to me.
Thanks a lot


RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
>http -  however, the big advantage of doing your indexing on different machine 
>is that the heavy lifting that tika does in extracting text from documents, 
>finding metadata etc is not happening on the server. If the indexer crashes, 
>it doesn’t affect Solr either.

+1 

for what can go wrong: 
http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
 

https://www.youtube.com/watch?v=vRPTPMwI53k=13s=43=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp

Really, we try our best on Tika, but sometimes bad things happen.  Let us know 
when they do, and we'll try to fix them.


RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Phil Scadden
http -  however, the big advantage of doing your indexing on different machine 
is that the heavy lifting that tika does in extracting text from documents, 
finding metadata etc is not happening on the server. If the indexer crashes, it 
doesn’t affect Solr either.

-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com]
Sent: Tuesday, 20 June 2017 11:29 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr) because 
Python is my main programming language. I have an impression that 1. they send 
HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to the 
server via HTTP or some other more native ways? Is the main benefit of SolrJ 
over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just
> trying to figure out what is going on by indexing one or two PDF files
> first. Thank you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson
> <erickerick...@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher
>> > <erik.hatc...@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >> Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned
>> >> > JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/
>> >> >
>> >> > :
>> >> >
>> >> > > stored="true"/>
>> >> > > indexed="true"
>> >> > stored="false"/>
>> >> > 
>> >> >
>> 

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
Yeah, Chris knows a thing or two about Tika.  :)

-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com] 
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

No intention of spamming but I also want to mention tika-python 
<https://github.com/chrismattmann/tika-python> in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr) 
> because Python is my main programming language. I have an impression 
> that 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ 
> talk to the server via HTTP or some other more native ways? Is the 
> main benefit of SolrJ over other clients the official shipment with Solr? 
> Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just 
>> trying to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
>> <erickerick...@gmail.com>
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a 
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of 
>>> problems there are when parsing all the different formats so I'd 
>>> _really_ follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all 
>>> these different formats, implemented by different vendors with 
>>> different versions that more or less follow a spec which really 
>>> isn't a spec in many cases just recommendations using packages that 
>>> may or may not be actively maintained. And by the way, we'll try to 
>>> handle that 1G document that someone sends us, but don't blame us if 
>>> we hit an OOM.". When Tika is run on the same box as Solr any 
>>> problems in that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU 
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for 
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content 
>>> > to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>>> > <erik.hatc...@gmail.com>
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >> Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for 
>>> >> > question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned 
>>> >> > JSON,
>>> with
>>> >> the
>>> >> > the same definitions

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
No intention of spamming but I also want to mention tika-python
 in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan  wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr)
> because Python is my main programming language. I have an impression that
> 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ talk to
> the server via HTTP or some other more native ways? Is the main benefit of
> SolrJ over other clients the official shipment with Solr? Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan"  wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just trying
>> to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of problems
>>> there are when parsing all the different formats so I'd _really_
>>> follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all these
>>> different formats, implemented by different vendors with different
>>> versions that more or less follow a spec which really isn't a spec in
>>> many cases just recommendations using packages that may or may not be
>>> actively maintained. And by the way, we'll try to handle that 1G
>>> document that someone sends us, but don't blame us if we hit an
>>> OOM.". When Tika is run on the same box as Solr any problems in
>>> that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >> Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post
>>> >> > >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> >> > :
>>> >> >
>>> >> > >> stored="true"/>
>>> >> > >> indexed="true"
>>> >> > stored="false"/>
>>> >> > 
>>> >> >
>>> >> > Is it so that Tika does not fill these two fields automatically and
>>> I
>>> >> have
>>> >> > to write some client code to fill them?
>>> >> >
>>> >> > Best regards,
>>> >> > Ziyuan
>>> >> >
>>> >> >
>>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>>> erickerick...@gmail.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> 1> Yes, you can use your single definition. The author identifies
>>> the
>>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>>> >> >> copyField directive copying (perhaps) many different fields to the
>>> >> >> "text" field. That permits simple searches against a single field
>>> >> >> rather than, say, using edismax to search across multiple separate
>>> >> >> fields.
>>> >> >>
>>> >> >> 2> The link you referenced is for Data Import Handler, which is
>>> much
>>> >> >> different than just posting files to Solr. See
>>> >> >> ExtractingRequestHandler:
>>> >> >> https://cwiki.apache.org/confluence/display/solr/
>>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>>> >> >> There are ways to map meta-data fields from the doc into specific
>>> >> >> fields matching your schema. Be a little careful 

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr)
because Python is my main programming language. I have an impression that
1. they send HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to
the server via HTTP or some other more native ways? Is the main benefit of
SolrJ over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan"  wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just trying to
> figure out what is going on by indexing one or two PDF files first. Thank
> you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >> Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > > >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> > :
>> >> >
>> >> > > stored="true"/>
>> >> > > indexed="true"
>> >> > stored="false"/>
>> >> > 
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>> >> >> copyField directive copying (perhaps) many different fields to the
>> >> >> "text" field. That permits simple searches against a single field
>> >> >> rather than, say, using edismax to search across multiple separate
>> >> >> fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which is much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >> https://cwiki.apache.org/confluence/display/solr/
>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> >> There are ways to map meta-data fields from the doc into specific
>> >> >> fields matching your schema. Be a little careful here. There is no
>> >> >> standard across different types of docs as to what meta-data field
>> is
>> >> >> included. PDF might have a "last_edited" field. Word might have a
>> >> >> "last_modified" field where the two mean the same thing. Here's a
>> link
>> >> >> to a SolrJ program that'll dump all the fields:
>> >> >> 

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Dear Erick and Timothy,

yes I will parse from the client for all the benefits. I am just trying to
figure out what is going on by indexing one or two PDF files first. Thank
you both.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson 
wrote:

> bq: Hope that there is no side effect of not mapping the PDF
>
> Well, yes it will have that side effect. You can cure that with a
> copyField directive from content to _text_.
>
> But do really consider running this as a SolrJ program on the client.
> Tim knows in far more painful detail than I do what kinds of problems
> there are when parsing all the different formats so I'd _really_
> follow his advice.
>
> Tika pretty much has an impossible job. "Here, try to parse all these
> different formats, implemented by different vendors with different
> versions that more or less follow a spec which really isn't a spec in
> many cases just recommendations using packages that may or may not be
> actively maintained. And by the way, we'll try to handle that 1G
> document that someone sends us, but don't blame us if we hit an
> OOM.". When Tika is run on the same box as Solr any problems in
> that entire chain can adversely affect your search.
>
> Not to mention that Tika has to do some heavy lifting, using CPU
> cycles that are unavailable for Solr.
>
> Extracting Request Handler is a fine way to get started, but for
> production seriously consider a separate client.
>
> Best,
> Erick
>
> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
> > Hi Erick,
> >
> > Now it is clear. I have to update the request handler of /update/extract/
> > from
> > "defaults":{"fmap.content":"_text_"}
> > to
> > "defaults":{"fmap.content":"content"}
> > to fill the field.
> >
> > Hope that there is no side effect of not mapping the PDF content to
> _text_.
> > Thank you for the hint.
> >
> > Best regards,
> > Ziyuan
> >
> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
> > wrote:
> >
> >> Ziyuan -
> >>
> >> You may be interested in the example/files that ships with Solr too.
> It’s
> >> got schema and config and even UI for file indexing and searching.
>  Check
> >> it out README.txt under example/files in your Solr install.
> >>
> >> Erik
> >>
> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
> >> >
> >> > Hi Erick,
> >> >
> >> > thanks very much for the explanations! Clarification for question 2:
> more
> >> > specifically I cannot see the field content in the returned JSON, with
> >> the
> >> > the same definitions as in the post
> >> >  >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> > :
> >> >
> >> >  stored="true"/>
> >> >  indexed="true"
> >> > stored="false"/>
> >> > 
> >> >
> >> > Is it so that Tika does not fill these two fields automatically and I
> >> have
> >> > to write some client code to fill them?
> >> >
> >> > Best regards,
> >> > Ziyuan
> >> >
> >> >
> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> Yes, you can use your single definition. The author identifies the
> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> >> copyField directive copying (perhaps) many different fields to the
> >> >> "text" field. That permits simple searches against a single field
> >> >> rather than, say, using edismax to search across multiple separate
> >> >> fields.
> >> >>
> >> >> 2> The link you referenced is for Data Import Handler, which is much
> >> >> different than just posting files to Solr. See
> >> >> ExtractingRequestHandler:
> >> >> https://cwiki.apache.org/confluence/display/solr/
> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> >> There are ways to map meta-data fields from the doc into specific
> >> >> fields matching your schema. Be a little careful here. There is no
> >> >> standard across different types of docs as to what meta-data field is
> >> >> included. PDF might have a "last_edited" field. Word might have a
> >> >> "last_modified" field where the two mean the same thing. Here's a
> link
> >> >> to a SolrJ program that'll dump all the fields:
> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
> easily
> >> >> hack out the DB bits.
> >> >>
> >> >> BTW, once you get more familiar with processing, I strongly recommend
> >> >> you do the document processing on the client, the reasons are
> outlined
> >> >> in that article.
> >> >>
> >> >> bq: even I define the fields as he said I cannot see them in the
> >> >> search results as keys in JSON
> >> >> are the fields set as stored="true"? They must be to be returned in
> >> >> requests (skipping the docValues discussion here).
> >> >>
> >> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> >> Because it has stored=false, you can only search it, you cannot
> >> >> highlight 

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erick Erickson
bq: Hope that there is no side effect of not mapping the PDF

Well, yes it will have that side effect. You can cure that with a
copyField directive from content to _text_.

But do really consider running this as a SolrJ program on the client.
Tim knows in far more painful detail than I do what kinds of problems
there are when parsing all the different formats so I'd _really_
follow his advice.

Tika pretty much has an impossible job. "Here, try to parse all these
different formats, implemented by different vendors with different
versions that more or less follow a spec which really isn't a spec in
many cases just recommendations using packages that may or may not be
actively maintained. And by the way, we'll try to handle that 1G
document that someone sends us, but don't blame us if we hit an
OOM.". When Tika is run on the same box as Solr any problems in
that entire chain can adversely affect your search.

Not to mention that Tika has to do some heavy lifting, using CPU
cycles that are unavailable for Solr.

Extracting Request Handler is a fine way to get started, but for
production seriously consider a separate client.

Best,
Erick

On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan  wrote:
> Hi Erick,
>
> Now it is clear. I have to update the request handler of /update/extract/
> from
> "defaults":{"fmap.content":"_text_"}
> to
> "defaults":{"fmap.content":"content"}
> to fill the field.
>
> Hope that there is no side effect of not mapping the PDF content to _text_.
> Thank you for the hint.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
> wrote:
>
>> Ziyuan -
>>
>> You may be interested in the example/files that ships with Solr too.  It’s
>> got schema and config and even UI for file indexing and searching.   Check
>> it out README.txt under example/files in your Solr install.
>>
>> Erik
>>
>> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
>> >
>> > Hi Erick,
>> >
>> > thanks very much for the explanations! Clarification for question 2: more
>> > specifically I cannot see the field content in the returned JSON, with
>> the
>> > the same definitions as in the post
>> > > hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> > :
>> >
>> > 
>> > > > stored="false"/>
>> > 
>> >
>> > Is it so that Tika does not fill these two fields automatically and I
>> have
>> > to write some client code to fill them?
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> >
>> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson > >
>> > wrote:
>> >
>> >> 1> Yes, you can use your single definition. The author identifies the
>> >> "text" field as a catch-all. Somewhere in the schema there'll be a
>> >> copyField directive copying (perhaps) many different fields to the
>> >> "text" field. That permits simple searches against a single field
>> >> rather than, say, using edismax to search across multiple separate
>> >> fields.
>> >>
>> >> 2> The link you referenced is for Data Import Handler, which is much
>> >> different than just posting files to Solr. See
>> >> ExtractingRequestHandler:
>> >> https://cwiki.apache.org/confluence/display/solr/
>> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> There are ways to map meta-data fields from the doc into specific
>> >> fields matching your schema. Be a little careful here. There is no
>> >> standard across different types of docs as to what meta-data field is
>> >> included. PDF might have a "last_edited" field. Word might have a
>> >> "last_modified" field where the two mean the same thing. Here's a link
>> >> to a SolrJ program that'll dump all the fields:
>> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
>> >> hack out the DB bits.
>> >>
>> >> BTW, once you get more familiar with processing, I strongly recommend
>> >> you do the document processing on the client, the reasons are outlined
>> >> in that article.
>> >>
>> >> bq: even I define the fields as he said I cannot see them in the
>> >> search results as keys in JSON
>> >> are the fields set as stored="true"? They must be to be returned in
>> >> requests (skipping the docValues discussion here).
>> >>
>> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> Because it has stored=false, you can only search it, you cannot
>> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >>
>> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> >> things, most particularly whether that text is ever actually in a
>> >> field in your index. Just because there's no guarantee that the name
>> >> of the file is indexed in a searchable/highlightable way.
>> >>
>> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
>> parsed
>> >> as
>> >> id:Trevor _text_:Hastie
>> >> _text_ is the default field, look for a "df" parameter in your request
>> >> 

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick,

Now it is clear. I have to update the request handler of /update/extract/
from
"defaults":{"fmap.content":"_text_"}
to
"defaults":{"fmap.content":"content"}
to fill the field.

Hope that there is no side effect of not mapping the PDF content to _text_.
Thank you for the hint.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher 
wrote:

> Ziyuan -
>
> You may be interested in the example/files that ships with Solr too.  It’s
> got schema and config and even UI for file indexing and searching.   Check
> it out README.txt under example/files in your Solr install.
>
> Erik
>
> > On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
> >
> > Hi Erick,
> >
> > thanks very much for the explanations! Clarification for question 2: more
> > specifically I cannot see the field content in the returned JSON, with
> the
> > the same definitions as in the post
> >  hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > :
> >
> > 
> >  > stored="false"/>
> > 
> >
> > Is it so that Tika does not fill these two fields automatically and I
> have
> > to write some client code to fill them?
> >
> > Best regards,
> > Ziyuan
> >
> >
> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson  >
> > wrote:
> >
> >> 1> Yes, you can use your single definition. The author identifies the
> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> copyField directive copying (perhaps) many different fields to the
> >> "text" field. That permits simple searches against a single field
> >> rather than, say, using edismax to search across multiple separate
> >> fields.
> >>
> >> 2> The link you referenced is for Data Import Handler, which is much
> >> different than just posting files to Solr. See
> >> ExtractingRequestHandler:
> >> https://cwiki.apache.org/confluence/display/solr/
> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> There are ways to map meta-data fields from the doc into specific
> >> fields matching your schema. Be a little careful here. There is no
> >> standard across different types of docs as to what meta-data field is
> >> included. PDF might have a "last_edited" field. Word might have a
> >> "last_modified" field where the two mean the same thing. Here's a link
> >> to a SolrJ program that'll dump all the fields:
> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> >> hack out the DB bits.
> >>
> >> BTW, once you get more familiar with processing, I strongly recommend
> >> you do the document processing on the client, the reasons are outlined
> >> in that article.
> >>
> >> bq: even I define the fields as he said I cannot see them in the
> >> search results as keys in JSON
> >> are the fields set as stored="true"? They must be to be returned in
> >> requests (skipping the docValues discussion here).
> >>
> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> Because it has stored=false, you can only search it, you cannot
> >> highlight or view. Fields you highlight must have stored=true BTW.
> >>
> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> >> things, most particularly whether that text is ever actually in a
> >> field in your index. Just because there's no guarantee that the name
> >> of the file is indexed in a searchable/highlightable way.
> >>
> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
> parsed
> >> as
> >> id:Trevor _text_:Hastie
> >> _text_ is the default field, look for a "df" parameter in your request
> >> handler in solrconfig.xml (usually "/select" or "/query").
> >>
> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan  wrote:
> >>> Hi,
> >>>
> >>> I am new to Solr and I need to implement a full-text search of some PDF
> >>> files. The indexing part works out of the box by using bin/post. I can
> >> see
> >>> search results in the admin UI given some queries, though without the
> >>> matched texts and the context.
> >>>
> >>> Now I am reading this post
> >>>  >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >>> for the highlighting part. It is for an older version of Solr when
> >> managed
> >>> schema was not available. Before fully understand what it is doing I
> have
> >>> some questions:
> >>>
> >>> 1. He defined two fields:
> >>>
> >>>  >>> multiValued="false"/>
> >>>  >>> multiValued="true"/>
> >>>
> >>> But why are there two fields needed? Can I define a field
> >>>
> >>>  >>> multiValued="true"/>
> >>>
> >>> to capture the full text?
> >>>
> >>> 2. How are the fields filled? I don't see relevant information in
> >>> TikaEntityProcessor's documentation
> >>>  dataimporthandler-extras/org/
> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> >> 

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.
> There is no standard across different types of docs as to what meta-data 
> field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing.

On Tika, we _try_ to normalize fields according to various standards, the most 
predominant is Dublin core, so that "author" in one format and "creator" in 
another will both be mapped to "dc:creator".  That said:

1) there are plenty of areas where we could do a better job of normalizing.  
Please let us know how to improve!
2) no matter how well we normalize, there are some metadata items that are 
specific to various file formats...I strongly recommend running Tika against a 
representative batch of documents and deciding which fields you need for your 
application.

Finally, if there's a chance you want metadata from embedded 
documents/attachments, checkout the RecursiveParserWrapper.  Under legacy Tika, 
if you have a bunch of images in a zip file, you'd never get the lat/longs...or 
you'd never get "dc:creator" from an MSWord file sent as an attachment in an 
MSG file.

Finally, and I mean it this time, I heartily second Erik's point about SolrJ 
and the need to keep your file processing outside of Solr's JVM, VM and M!




-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Monday, June 19, 2017 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got 
schema and config and even UI for file indexing and searching.   Check it out 
README.txt under example/files in your Solr install.

Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: 
> more specifically I cannot see the field content in the returned JSON, 
> with the the same definitions as in the post 
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-t
> ext-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
>  stored="true"/>  indexed="true"
> stored="false"/>
> 
> 
> Is it so that Tika does not fill these two fields automatically and I 
> have to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
> <erickerick...@gmail.com>
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a 
>> copyField directive copying (perhaps) many different fields to the 
>> "text" field. That permits simple searches against a single field 
>> rather than, say, using edismax to search across multiple separate 
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific 
>> fields matching your schema. Be a little careful here. There is no 
>> standard across different types of docs as to what meta-data field is 
>> included. PDF might have a "last_edited" field. Word might have a 
>> "last_modified" field where the two mean the same thing. Here's a 
>> link to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can 
>> easily hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend 
>> you do the document processing on the client, the reasons are 
>> outlined in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the 
>> search results as keys in JSON are the fields set as stored="true"? 
>> They must be to be returned in requests (skipping the docValues 
>> discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot 
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of 
>> things, most particularly whether that text is ever actually in a 
>> field in your index. Just because there's no guarantee that the name 
>> of the file is indexed in a searchable/highl

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erik Hatcher
Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got 
schema and config and even UI for file indexing and searching.   Check it out 
README.txt under example/files in your Solr install.

Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan  wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: more
> specifically I cannot see the field content in the returned JSON, with the
> the same definitions as in the post
> 
> :
> 
> 
>  stored="false"/>
> 
> 
> Is it so that Tika does not fill these two fields automatically and I have
> to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a
>> copyField directive copying (perhaps) many different fields to the
>> "text" field. That permits simple searches against a single field
>> rather than, say, using edismax to search across multiple separate
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific
>> fields matching your schema. Be a little careful here. There is no
>> standard across different types of docs as to what meta-data field is
>> included. PDF might have a "last_edited" field. Word might have a
>> "last_modified" field where the two mean the same thing. Here's a link
>> to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
>> hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend
>> you do the document processing on the client, the reasons are outlined
>> in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the
>> search results as keys in JSON
>> are the fields set as stored="true"? They must be to be returned in
>> requests (skipping the docValues discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> things, most particularly whether that text is ever actually in a
>> field in your index. Just because there's no guarantee that the name
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
>> as
>> id:Trevor _text_:Hastie
>> _text_ is the default field, look for a "df" parameter in your request
>> handler in solrconfig.xml (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan  wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some PDF
>>> files. The indexing part works out of the box by using bin/post. I can
>> see
>>> search results in the admin UI given some queries, though without the
>>> matched texts and the context.
>>> 
>>> Now I am reading this post
>>> > hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> for the highlighting part. It is for an older version of Solr when
>> managed
>>> schema was not available. Before fully understand what it is doing I have
>>> some questions:
>>> 
>>> 1. He defined two fields:
>>> 
>>> >> multiValued="false"/>
>>> >> multiValued="true"/>
>>> 
>>> But why are there two fields needed? Can I define a field
>>> 
>>> >> multiValued="true"/>
>>> 
>>> to capture the full text?
>>> 
>>> 2. How are the fields filled? I don't see relevant information in
>>> TikaEntityProcessor's documentation
>>> > apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> fields.inherited.from.class.org.apache.solr.handler.
>> dataimport.EntityProcessorBase>.
>>> The current text extractor should already be Tika (I can see
>>> 
>>> "x_parsed_by":
>>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> tika.parser.pdf.PDFParser"]
>>> 
>>> in the returned JSON of some query). But even I define the fields as he
>>> said I cannot see them in the search results as keys in JSON.
>>> 
>>> 3. The _text_ field seems a concatenation of other fields, does it
>> contain
>>> the full text? Though it does not seem to be accessible by default.
>>> 
>>> To be brief, using The Elements of Statistical 

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick,

thanks very much for the explanations! Clarification for question 2: more
specifically I cannot see the field content in the returned JSON, with the
the same definitions as in the post

:





Is it so that Tika does not fill these two fields automatically and I have
to write some client code to fill them?

Best regards,
Ziyuan


On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson 
wrote:

> 1> Yes, you can use your single definition. The author identifies the
> "text" field as a catch-all. Somewhere in the schema there'll be a
> copyField directive copying (perhaps) many different fields to the
> "text" field. That permits simple searches against a single field
> rather than, say, using edismax to search across multiple separate
> fields.
>
> 2> The link you referenced is for Data Import Handler, which is much
> different than just posting files to Solr. See
> ExtractingRequestHandler:
> https://cwiki.apache.org/confluence/display/solr/
> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> There are ways to map meta-data fields from the doc into specific
> fields matching your schema. Be a little careful here. There is no
> standard across different types of docs as to what meta-data field is
> included. PDF might have a "last_edited" field. Word might have a
> "last_modified" field where the two mean the same thing. Here's a link
> to a SolrJ program that'll dump all the fields:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> hack out the DB bits.
>
> BTW, once you get more familiar with processing, I strongly recommend
> you do the document processing on the client, the reasons are outlined
> in that article.
>
> bq: even I define the fields as he said I cannot see them in the
> search results as keys in JSON
> are the fields set as stored="true"? They must be to be returned in
> requests (skipping the docValues discussion here).
>
> 3> Yes, the text field is a concatenation of all the other ones.
> Because it has stored=false, you can only search it, you cannot
> highlight or view. Fields you highlight must have stored=true BTW.
>
> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> things, most particularly whether that text is ever actually in a
> field in your index. Just because there's no guarantee that the name
> of the file is indexed in a searchable/highlightable way.
>
> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
> as
> id:Trevor _text_:Hastie
> _text_ is the default field, look for a "df" parameter in your request
> handler in solrconfig.xml (usually "/select" or "/query").
>
> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan  wrote:
> > Hi,
> >
> > I am new to Solr and I need to implement a full-text search of some PDF
> > files. The indexing part works out of the box by using bin/post. I can
> see
> > search results in the admin UI given some queries, though without the
> > matched texts and the context.
> >
> > Now I am reading this post
> >  hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > for the highlighting part. It is for an older version of Solr when
> managed
> > schema was not available. Before fully understand what it is doing I have
> > some questions:
> >
> > 1. He defined two fields:
> >
> >  > multiValued="false"/>
> >  > multiValued="true"/>
> >
> > But why are there two fields needed? Can I define a field
> >
> >  > multiValued="true"/>
> >
> > to capture the full text?
> >
> > 2. How are the fields filled? I don't see relevant information in
> > TikaEntityProcessor's documentation
> >  apache/solr/handler/dataimport/TikaEntityProcessor.html#
> fields.inherited.from.class.org.apache.solr.handler.
> dataimport.EntityProcessorBase>.
> > The current text extractor should already be Tika (I can see
> >
> > "x_parsed_by":
> > ["org.apache.tika.parser.DefaultParser","org.apache.
> tika.parser.pdf.PDFParser"]
> >
> > in the returned JSON of some query). But even I define the fields as he
> > said I cannot see them in the search results as keys in JSON.
> >
> > 3. The _text_ field seems a concatenation of other fields, does it
> contain
> > the full text? Though it does not seem to be accessible by default.
> >
> > To be brief, using The Elements of Statistical Learning
> >  ESLII_print10.pdf>
> > as an example, how to highlight the relevant texts for the query "SVM"?
> And
> > if changing the file name into "The Elements of Statistical Learning -
> > Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> > query "id:Trevor Hastie"?
> >
> > Thank you.
> >
> > Best regards,
> > Ziyuan
>


Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-18 Thread Erick Erickson
1> Yes, you can use your single definition. The author identifies the
"text" field as a catch-all. Somewhere in the schema there'll be a
copyField directive copying (perhaps) many different fields to the
"text" field. That permits simple searches against a single field
rather than, say, using edismax to search across multiple separate
fields.

2> The link you referenced is for Data Import Handler, which is much
different than just posting files to Solr. See
ExtractingRequestHandler:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika.
There are ways to map meta-data fields from the doc into specific
fields matching your schema. Be a little careful here. There is no
standard across different types of docs as to what meta-data field is
included. PDF might have a "last_edited" field. Word might have a
"last_modified" field where the two mean the same thing. Here's a link
to a SolrJ program that'll dump all the fields:
https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
hack out the DB bits.

BTW, once you get more familiar with processing, I strongly recommend
you do the document processing on the client, the reasons are outlined
in that article.

bq: even I define the fields as he said I cannot see them in the
search results as keys in JSON
are the fields set as stored="true"? They must be to be returned in
requests (skipping the docValues discussion here).

3> Yes, the text field is a concatenation of all the other ones.
Because it has stored=false, you can only search it, you cannot
highlight or view. Fields you highlight must have stored=true BTW.

Whether or not you can highlight "Trevor Hastie" depends an a lot of
things, most particularly whether that text is ever actually in a
field in your index. Just because there's no guarantee that the name
of the file is indexed in a searchable/highlightable way.

And the query q=id:Trevor Hastie won't do what you think. It'll be parsed as
id:Trevor _text_:Hastie
_text_ is the default field, look for a "df" parameter in your request
handler in solrconfig.xml (usually "/select" or "/query").

On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan  wrote:
> Hi,
>
> I am new to Solr and I need to implement a full-text search of some PDF
> files. The indexing part works out of the box by using bin/post. I can see
> search results in the admin UI given some queries, though without the
> matched texts and the context.
>
> Now I am reading this post
> 
> for the highlighting part. It is for an older version of Solr when managed
> schema was not available. Before fully understand what it is doing I have
> some questions:
>
> 1. He defined two fields:
>
>  multiValued="false"/>
>  multiValued="true"/>
>
> But why are there two fields needed? Can I define a field
>
>  multiValued="true"/>
>
> to capture the full text?
>
> 2. How are the fields filled? I don't see relevant information in
> TikaEntityProcessor's documentation
> .
> The current text extractor should already be Tika (I can see
>
> "x_parsed_by":
> ["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
>
> in the returned JSON of some query). But even I define the fields as he
> said I cannot see them in the search results as keys in JSON.
>
> 3. The _text_ field seems a concatenation of other fields, does it contain
> the full text? Though it does not seem to be accessible by default.
>
> To be brief, using The Elements of Statistical Learning
> 
> as an example, how to highlight the relevant texts for the query "SVM"? And
> if changing the file name into "The Elements of Statistical Learning -
> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> query "id:Trevor Hastie"?
>
> Thank you.
>
> Best regards,
> Ziyuan


Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-17 Thread ZiYuan
Hi,

I am new to Solr and I need to implement a full-text search of some PDF
files. The indexing part works out of the box by using bin/post. I can see
search results in the admin UI given some queries, though without the
matched texts and the context.

Now I am reading this post

for the highlighting part. It is for an older version of Solr when managed
schema was not available. Before fully understand what it is doing I have
some questions:

1. He defined two fields:




But why are there two fields needed? Can I define a field



to capture the full text?

2. How are the fields filled? I don't see relevant information in
TikaEntityProcessor's documentation
.
The current text extractor should already be Tika (I can see

"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]

in the returned JSON of some query). But even I define the fields as he
said I cannot see them in the search results as keys in JSON.

3. The _text_ field seems a concatenation of other fields, does it contain
the full text? Though it does not seem to be accessible by default.

To be brief, using The Elements of Statistical Learning

as an example, how to highlight the relevant texts for the query "SVM"? And
if changing the file name into "The Elements of Statistical Learning -
Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
query "id:Trevor Hastie"?

Thank you.

Best regards,
Ziyuan


Re: indexing pdf files using post tool

2016-03-19 Thread Francisco Andrés Fernández
Vidya, I don't know if I'm understanding it very well but, I think that the
best way is to parse your text using a routine outside Solr. You might need
to map the different parts of your document using your domain knowledge and
use such routine to produce an XML document for example, with corresponding
tags for any part you need to differentiate. After that you could index it
in Solr.
Francisco

El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
escribió:

> Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> indexed with different fields in a document of solr according to data in it
> like name;id;title;content etc
>
> Thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: indexing pdf files using post tool

2016-03-19 Thread Binoy Dalal
Take a look at the CloneFieldUpdateProcessorFactory here:
http://www.solr-start.com/info/update-request-processors/

On Wed, 16 Mar 2016, 18:25 Binoy Dalal, <binoydala...@gmail.com> wrote:

> Like Francisco said, use a custom update processor to map the fields the
> way you want and add it to your update chain.
>
> On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com>
> wrote:
>
>> Vidya, I don't know if I'm understanding it very well but, I think that
>> the
>> best way is to parse your text using a routine outside Solr. You might
>> need
>> to map the different parts of your document using your domain knowledge
>> and
>> use such routine to produce an XML document for example, with
>> corresponding
>> tags for any part you need to differentiate. After that you could index it
>> in Solr.
>> Francisco
>>
>> El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
>> escribió:
>>
>> > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
>> > indexed with different fields in a document of solr according to data
>> in it
>> > like name;id;title;content etc
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
> --
> Regards,
> Binoy Dalal
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-18 Thread Binoy Dalal
Like Francisco said, use a custom update processor to map the fields the
way you want and add it to your update chain.

On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, <fra...@gmail.com>
wrote:

> Vidya, I don't know if I'm understanding it very well but, I think that the
> best way is to parse your text using a routine outside Solr. You might need
> to map the different parts of your document using your domain knowledge and
> use such routine to produce an XML document for example, with corresponding
> tags for any part you need to differentiate. After that you could index it
> in Solr.
> Francisco
>
> El mié., 16 de mar. de 2016 a la(s) 04:18, vidya <vidya.nade...@tcs.com>
> escribió:
>
> > Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> > indexed with different fields in a document of solr according to data in
> it
> > like name;id;title;content etc
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-18 Thread Jan Høydahl
Hi

You can look at the Apache Tika project or the PDFBox project to parse your 
files before sending to Solr.
Alternatively, if your processing is very simple, you can use the built-in Tika 
as U just did, and
then deploy some UpdateRequestProcessor’s in order to modify the Tika output 
into whatever fields you like.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. mar. 2016 kl. 08.18 skrev vidya <vidya.nade...@tcs.com>:
> 
> Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> indexed with different fields in a document of solr according to data in it
> like name;id;title;content etc
> 
> Thanks 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing pdf files using post tool

2016-03-16 Thread vidya
Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
indexed with different fields in a document of solr according to data in it
like name;id;title;content etc

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing pdf files using post tool

2016-03-15 Thread roshan agarwal
Yes vidya, you just have to use copy field

Roshan

On Tue, Mar 15, 2016 at 3:07 PM, vidya <vidya.nade...@tcs.com> wrote:

> Hi
> I got data into my content field. But i wanted to have differnt fields to
> be
> allocated for data in my file.How can I achieve this ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Roshan Agarwal
Managing Director
Siddhast IP Innovation (P) Ltd
Phone: +91 11-65246257
M:+91 9871549769
email: ros...@siddhast.com
-
About SIDDHAST(www.siddhast.com)
SIDDHAST is a research and analytical company, which provide service in the
following area-Intellectual Property, Market Research, Business
Research,Technology Transfer. The company is Incorporated in March 2007,
and has completed more than 100 assignments.
URL: www.siddhast.com

--
This message (including attachments, if any) is confidential and may be
privileged. Before opening the attachments please check them for viruses
and defects. M/s Siddhast Intellectual Property Innovations Pvt Ltd will
not be responsible for any viruses or defects or any forwarded attachments
emanating either from within SIDDHAST or outside.


Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
You should use copy fields.
https://cwiki.apache.org/confluence/display/solr/Copying+Fields

On Tue, 15 Mar 2016, 15:07 vidya, <vidya.nade...@tcs.com> wrote:

> Hi
> I got data into my content field. But i wanted to have differnt fields to
> be
> allocated for data in my file.How can I achieve this ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Regards,
Binoy Dalal


Re: indexing pdf files using post tool

2016-03-15 Thread vidya
Hi
I got data into my content field. But i wanted to have differnt fields to be
allocated for data in my file.How can I achieve this ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
Do you have a "content" field defined in your schema? Is it stored?

By default, the content from the docs uploaded through post should be
mapped to a field called "content".

On Tue, 15 Mar 2016, 12:47 vidya, <vidya.nade...@tcs.com> wrote:

> Hi
> I am trying to index a pdf file by using post tool in my linux system,When
> i
> give the command
> bin/post -c core2 -p 8984 /root/solr/My_CV.pdf
> it is showing the search results like
> "response": {
> "numFound": 1,
> "start": 0,
> "docs": [
>   {
> "id": "/root/solr-5.5.0/My_CV.pdf",
> "meta_creation_date": [
>   "2016-03-15T06:22:17Z"
> ],
> "pdf_pdfversion": [
>   1.4
> ],
> "dcterms_created": [
>   "2016-03-15T06:22:17Z"
> ],
> "x_parsed_by": [
>   "org.apache.tika.parser.DefaultParser",
>   "org.apache.tika.parser.pdf.PDFParser"
> ],
> "xmptpg_npages": [
>   1
> ],
> "creation_date": [
>   "2016-03-15T06:22:17Z"
> ],
> "pdf_encrypted": [
>   false
> ],
> "title": [
>   "My CV"
> ],
> "stream_content_type": [
>   "application/pdf"
> ],
> "created": [
>   "Tue Mar 15 06:22:17 UTC 2016"
> ],
> "stream_size": [
>   18289
> ],
> "dc_format": [
>   "application/pdf; version=1.4"
> ],
> "producer": [
>   "wkhtmltopdf"
> ],
> "content_type": [
>   "application/pdf"
> ],
> "xmp_creatortool": [
>   "þÿ"
> ],
> "resourcename": [
>   "/root/solr/My_CV.pdf"
> ],
> "dc_title": [
>   "My CV"
> ],
> "_version_": 1528851429701189600
>   }
>
>
> but not the actual content in pdf file.
> How to index that dat.
> Please help me on this.
> Can post tool be used for indexing data from HDFS ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Regards,
Binoy Dalal


indexing pdf files using post tool

2016-03-15 Thread vidya
Hi
I am trying to index a pdf file by using post tool in my linux system,When i
give the command
bin/post -c core2 -p 8984 /root/solr/My_CV.pdf
it is showing the search results like 
"response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"id": "/root/solr-5.5.0/My_CV.pdf",
"meta_creation_date": [
  "2016-03-15T06:22:17Z"
],
"pdf_pdfversion": [
  1.4
],
"dcterms_created": [
  "2016-03-15T06:22:17Z"
],
"x_parsed_by": [
  "org.apache.tika.parser.DefaultParser",
  "org.apache.tika.parser.pdf.PDFParser"
],
"xmptpg_npages": [
  1
],
"creation_date": [
  "2016-03-15T06:22:17Z"
],
"pdf_encrypted": [
  false
],
"title": [
  "My CV"
],
"stream_content_type": [
  "application/pdf"
],
"created": [
  "Tue Mar 15 06:22:17 UTC 2016"
],
"stream_size": [
  18289
],
"dc_format": [
  "application/pdf; version=1.4"
],
"producer": [
  "wkhtmltopdf"
],
"content_type": [
  "application/pdf"
],
"xmp_creatortool": [
  "þÿ"
],
"resourcename": [
  "/root/solr/My_CV.pdf"
],
"dc_title": [
  "My CV"
],
"_version_": 1528851429701189600
  }


but not the actual content in pdf file.
How to index that dat.
Please help me on this.
Can post tool be used for indexing data from HDFS ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811.html
Sent from the Solr - User mailing list archive at Nabble.com.


indexing pdf binary stored in mongodb?

2016-02-05 Thread Arnett, Gabriel
Anyone have any experience indexing pdfs stored in binary form in mongodb?

.
Gabe Arnett
Senior Director
Moody's Analytics

-

The information contained in this e-mail message, and any attachment thereto, 
is confidential and may not be disclosed without our express permission. If you 
are not the intended recipient or an employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
you have received this message in error and that any review, dissemination, 
distribution or copying of this message, or any attachment thereto, in whole or 
in part, is strictly prohibited. If you have received this message in error, 
please immediately notify us by telephone, fax or e-mail and delete the message 
and all of its attachments. Thank you. Every effort is made to keep our network 
free from viruses. You should, however, review this e-mail message, as well as 
any attachment thereto, for viruses. We take no responsibility and have no 
liability for any computer virus which may be transferred via this e-mail 
message.


Re: indexing pdf binary stored in mongodb?

2016-02-05 Thread Jack Krupansky
See if they are stored in BSON format using GridFS. If so, you can simply
use the mongofiles command to retrieve the PDF into a local file and index
that in Solr either using Solr Cell or Tika.

See:
http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb
https://docs.mongodb.org/manual/reference/program/mongofiles/


-- Jack Krupansky

On Fri, Feb 5, 2016 at 3:13 PM, Arnett, Gabriel 
wrote:

> Anyone have any experience indexing pdfs stored in binary form in mongodb?
>
> .
> Gabe Arnett
> Senior Director
> Moody's Analytics
>
> -
>
> The information contained in this e-mail message, and any attachment
> thereto, is confidential and may not be disclosed without our express
> permission. If you are not the intended recipient or an employee or agent
> responsible for delivering this message to the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution or copying of this message, or any
> attachment thereto, in whole or in part, is strictly prohibited. If you
> have received this message in error, please immediately notify us by
> telephone, fax or e-mail and delete the message and all of its attachments.
> Thank you. Every effort is made to keep our network free from viruses. You
> should, however, review this e-mail message, as well as any attachment
> thereto, for viruses. We take no responsibility and have no liability for
> any computer virus which may be transferred via this e-mail message.
>


Re: solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Binoy Dalal
Do you see any exceptions in the solr log?

On Sat, 23 Jan 2016, 16:29 Moncif Aidi  wrote:

> HI,
>
> I have a problem with integrating solr in Ubuntu server.Before using solr
> on ubuntu server i tested it on my mac it was working perfectly. it indexed
> my PDF,Doc,Docx documents.so after installing solr on ubuntu server and
> using the same configuration files and librairies. i've found out that solr
> doesn't index PDf documents.But i can search over .Doc and .Docx documents.
> here some parts of my solrconfig.xml contents :
>
>  regex=".*\.jar" />
>regex="solr-cell-\d.*\.jar" />
>
>startup="lazy"
>   class="solr.extraction.ExtractingRequestHandler" >
> 
>   true
>   ignored_
>   _text_
> 
>   
>
>
> --
> M:+212 658541045
> Linkedin
> <
> https://www.linkedin.com/profile/view?id=131220035=nav_responsive_tab_profile
> >
>
> <
> https://www.linkedin.com/profile/view?id=131220035=nav_responsive_tab_profile
> >
> |  Facebook
>  |  *Skype :* moncif44
>
-- 
Regards,
Binoy Dalal


solr Indexing PDF attachments not working. in ubuntu

2016-01-23 Thread Moncif Aidi
HI,

I have a problem with integrating solr in Ubuntu server.Before using solr
on ubuntu server i tested it on my mac it was working perfectly. it indexed
my PDF,Doc,Docx documents.so after installing solr on ubuntu server and
using the same configuration files and librairies. i've found out that solr
doesn't index PDf documents.But i can search over .Doc and .Docx documents.
here some parts of my solrconfig.xml contents :


  



  true
  ignored_
  _text_

  


-- 
M:+212 658541045
Linkedin



|  Facebook
 |  *Skype :* moncif44


Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo
Thanks for all your replies.

I did chance upon this question from stackoverflow which it says is able to
solve the issues:
http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/

However, when I tried to run it, it still get the same "?" output in
the content, the same as what I get from the Tika app.

Regards,
Edwin


On 17 December 2015 at 23:58, Walter Underwood 
wrote:

> PDF isn’t really text. For example, it doesn’t have spaces, it just moves
> the next letter over farther. Letters might not be in reading order — two
> column text could be printed as horizontal scans. Custom fonts might not
> use an encoding that matches Unicode, which makes them encrypted (badly).
> And so on.
>
> As one of my coworkers said, trying to turn a PDF into structured text is
> like trying to turn hamburger back into a cow.
>
> PDF is where text goes to die.
>
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
> >
> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> >> Hi Alexandre,
> >>
> >> Thanks for your reply.
> >>
> >> So the only way to solve this issue is to explore with PDF specific
> tools
> >> and change the encoding of the file?
> >> Is there any way to configure it in Solr?
> >
> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
> created in a way that Tika cannot easily extract the text, there's nothing
> you can do in Solr that will help.
> >
> > Unfortunately PDF isn't a content format but a presentation format - so
> extracting plain text is fraught with difficulty. You may see a character
> on a PDF page, but exactly how that character is generated (using a
> specific encoding, font, or even by drawing a picture) is outside your
> control. There are various businesses built on this premise - they charge
> for creating clean extracted text from PDFs - and even they have trouble
> with some PDFs.
> >
> > HTH
> >
> > Charlie
> >
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch  >
> >> wrote:
> >>
> >>> They could be using custom fonts and non-Unicode characters. That's
> >>> probably something to explore with PDF specific tools.
> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
> >>> wrote:
> >>>
>  I've checked all the files which has problem with the content in the
> Solr
>  index using the Tika app. All of them shows the same issues as what I
> see
>  in the Solr index.
> 
>  So does the issues lies with the encoding of the file? Are we able to
> >>> check
>  the encoding of the file?
> 
> 
>  Regards,
>  Edwin
> 
> 
>  On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
>  wrote:
> 
> > Hi Erik,
> >
> > I've shared the file on dropbox, which you can access via the link
> >>> here:
> >
> >>>
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >
> > This is what I get from the Tika app after dropping the file in.
> >
> > Content-Length: 75092
> > Content-Type: application/pdf
> > Type: COSName{Info}
> > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > X-TIKA:digest:SHA256:
> > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > access_permission:assemble_document: true
> > access_permission:can_modify: true
> > access_permission:can_print: true
> > access_permission:can_print_degraded: true
> > access_permission:extract_content: true
> > access_permission:extract_for_accessibility: true
> > access_permission:fill_in_form: true
> > access_permission:modify_annotations: true
> > dc:format: application/pdf; version=1.3
> > pdf:PDFVersion: 1.3
> > pdf:encrypted: false
> > producer: null
> > resourceName: Desmophen+670+BAe.pdf
> > xmpTPg:NPages: 3
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:15, Erik Hatcher 
>  wrote:
> >
> >> Edwin - Can you share one of those PDF files?
> >>
> >> Also, drop the file into the Tika app and see what it sees directly
> -
>  get
> >> the tika-app JAR and run that desktop application.
> >>
> >> Could be an encoding issue?
> >>
> >> Erik
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com 
> >>
> >>
> >>
> >>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>  edwinye...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'm using Solr 5.3.0
> >>>
> >>> I'm indexing some PDF documents. However, for certain PDF files,
> >>> there
> >> are
> 

Re: Issues when indexing PDF files

2015-12-18 Thread Erick Erickson
This could also simply be your browser isn't set up to
display UTF-8, the characters may be just fine.

Best,
Erick

On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo
 wrote:
> Thanks for all your replies.
>
> I did chance upon this question from stackoverflow which it says is able to
> solve the issues:
> http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
>
> However, when I tried to run it, it still get the same "?" output in
> the content, the same as what I get from the Tika app.
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 23:58, Walter Underwood 
> wrote:
>
>> PDF isn’t really text. For example, it doesn’t have spaces, it just moves
>> the next letter over farther. Letters might not be in reading order — two
>> column text could be printed as horizontal scans. Custom fonts might not
>> use an encoding that matches Unicode, which makes them encrypted (badly).
>> And so on.
>>
>> As one of my coworkers said, trying to turn a PDF into structured text is
>> like trying to turn hamburger back into a cow.
>>
>> PDF is where text goes to die.
>>
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
>> >
>> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> >> Hi Alexandre,
>> >>
>> >> Thanks for your reply.
>> >>
>> >> So the only way to solve this issue is to explore with PDF specific
>> tools
>> >> and change the encoding of the file?
>> >> Is there any way to configure it in Solr?
>> >
>> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
>> created in a way that Tika cannot easily extract the text, there's nothing
>> you can do in Solr that will help.
>> >
>> > Unfortunately PDF isn't a content format but a presentation format - so
>> extracting plain text is fraught with difficulty. You may see a character
>> on a PDF page, but exactly how that character is generated (using a
>> specific encoding, font, or even by drawing a picture) is outside your
>> control. There are various businesses built on this premise - they charge
>> for creating clean extracted text from PDFs - and even they have trouble
>> with some PDFs.
>> >
>> > HTH
>> >
>> > Charlie
>> >
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch > >
>> >> wrote:
>> >>
>> >>> They could be using custom fonts and non-Unicode characters. That's
>> >>> probably something to explore with PDF specific tools.
>> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>> >>> wrote:
>> >>>
>>  I've checked all the files which has problem with the content in the
>> Solr
>>  index using the Tika app. All of them shows the same issues as what I
>> see
>>  in the Solr index.
>> 
>>  So does the issues lies with the encoding of the file? Are we able to
>> >>> check
>>  the encoding of the file?
>> 
>> 
>>  Regards,
>>  Edwin
>> 
>> 
>>  On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>>  wrote:
>> 
>> > Hi Erik,
>> >
>> > I've shared the file on dropbox, which you can access via the link
>> >>> here:
>> >
>> >>>
>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>> >
>> > This is what I get from the Tika app after dropping the file in.
>> >
>> > Content-Length: 75092
>> > Content-Type: application/pdf
>> > Type: COSName{Info}
>> > X-Parsed-By: org.apache.tika.parser.DefaultParser
>> > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>> > X-TIKA:digest:SHA256:
>> > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>> > access_permission:assemble_document: true
>> > access_permission:can_modify: true
>> > access_permission:can_print: true
>> > access_permission:can_print_degraded: true
>> > access_permission:extract_content: true
>> > access_permission:extract_for_accessibility: true
>> > access_permission:fill_in_form: true
>> > access_permission:modify_annotations: true
>> > dc:format: application/pdf; version=1.3
>> > pdf:PDFVersion: 1.3
>> > pdf:encrypted: false
>> > producer: null
>> > resourceName: Desmophen+670+BAe.pdf
>> > xmpTPg:NPages: 3
>> >
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 17 December 2015 at 00:15, Erik Hatcher 
>>  wrote:
>> >
>> >> Edwin - Can you share one of those PDF files?
>> >>
>> >> Also, drop the file into the Tika app and see what it sees directly
>> -
>>  get
>> >> the tika-app JAR and run that desktop application.
>> >>
>> >> Could be an encoding issue?
>> >>
>> >> Erik
>> >>
>> >> —
>> >> Erik Hatcher, Senior Solutions Architect
>> >> 

Re: Issues when indexing PDF files

2015-12-18 Thread Zheng Lin Edwin Yeo
Hi Erick,

Thanks for your reply.

However, it is unlikely to be the browser issue, as the same result occurs
when I tried it in the Tika app.

Regards,
Edwin


On 18 December 2015 at 23:39, Erick Erickson 
wrote:

> This could also simply be your browser isn't set up to
> display UTF-8, the characters may be just fine.
>
> Best,
> Erick
>
> On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo
>  wrote:
> > Thanks for all your replies.
> >
> > I did chance upon this question from stackoverflow which it says is able
> to
> > solve the issues:
> >
> http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
> >
> > However, when I tried to run it, it still get the same "?" output in
> > the content, the same as what I get from the Tika app.
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 23:58, Walter Underwood 
> > wrote:
> >
> >> PDF isn’t really text. For example, it doesn’t have spaces, it just
> moves
> >> the next letter over farther. Letters might not be in reading order —
> two
> >> column text could be printed as horizontal scans. Custom fonts might not
> >> use an encoding that matches Unicode, which makes them encrypted
> (badly).
> >> And so on.
> >>
> >> As one of my coworkers said, trying to turn a PDF into structured text
> is
> >> like trying to turn hamburger back into a cow.
> >>
> >> PDF is where text goes to die.
> >>
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >> > On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
> >> >
> >> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> >> >> Hi Alexandre,
> >> >>
> >> >> Thanks for your reply.
> >> >>
> >> >> So the only way to solve this issue is to explore with PDF specific
> >> tools
> >> >> and change the encoding of the file?
> >> >> Is there any way to configure it in Solr?
> >> >
> >> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
> >> created in a way that Tika cannot easily extract the text, there's
> nothing
> >> you can do in Solr that will help.
> >> >
> >> > Unfortunately PDF isn't a content format but a presentation format -
> so
> >> extracting plain text is fraught with difficulty. You may see a
> character
> >> on a PDF page, but exactly how that character is generated (using a
> >> specific encoding, font, or even by drawing a picture) is outside your
> >> control. There are various businesses built on this premise - they
> charge
> >> for creating clean extracted text from PDFs - and even they have trouble
> >> with some PDFs.
> >> >
> >> > HTH
> >> >
> >> > Charlie
> >> >
> >> >>
> >> >> Regards,
> >> >> Edwin
> >> >>
> >> >>
> >> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <
> arafa...@gmail.com
> >> >
> >> >> wrote:
> >> >>
> >> >>> They could be using custom fonts and non-Unicode characters. That's
> >> >>> probably something to explore with PDF specific tools.
> >> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo"  >
> >> >>> wrote:
> >> >>>
> >>  I've checked all the files which has problem with the content in
> the
> >> Solr
> >>  index using the Tika app. All of them shows the same issues as
> what I
> >> see
> >>  in the Solr index.
> >> 
> >>  So does the issues lies with the encoding of the file? Are we able
> to
> >> >>> check
> >>  the encoding of the file?
> >> 
> >> 
> >>  Regards,
> >>  Edwin
> >> 
> >> 
> >>  On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinye...@gmail.com>
> >>  wrote:
> >> 
> >> > Hi Erik,
> >> >
> >> > I've shared the file on dropbox, which you can access via the link
> >> >>> here:
> >> >
> >> >>>
> >>
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >> >
> >> > This is what I get from the Tika app after dropping the file in.
> >> >
> >> > Content-Length: 75092
> >> > Content-Type: application/pdf
> >> > Type: COSName{Info}
> >> > X-Parsed-By: org.apache.tika.parser.DefaultParser
> >> > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> >> > X-TIKA:digest:SHA256:
> >> > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> >> > access_permission:assemble_document: true
> >> > access_permission:can_modify: true
> >> > access_permission:can_print: true
> >> > access_permission:can_print_degraded: true
> >> > access_permission:extract_content: true
> >> > access_permission:extract_for_accessibility: true
> >> > access_permission:fill_in_form: true
> >> > access_permission:modify_annotations: true
> >> > dc:format: application/pdf; version=1.3
> >> > pdf:PDFVersion: 1.3
> >> > pdf:encrypted: false
> >> > producer: null
> >> > resourceName: Desmophen+670+BAe.pdf
> >> > xmpTPg:NPages: 3
> >> >
> >> >
> >> > 

Re: Issues when indexing PDF files

2015-12-17 Thread Zheng Lin Edwin Yeo
Hi Alexandre,

Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?

Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch 
wrote:

> They could be using custom fonts and non-Unicode characters. That's
> probably something to explore with PDF specific tools.
> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
> wrote:
>
> > I've checked all the files which has problem with the content in the Solr
> > index using the Tika app. All of them shows the same issues as what I see
> > in the Solr index.
> >
> > So does the issues lies with the encoding of the file? Are we able to
> check
> > the encoding of the file?
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi Erik,
> > >
> > > I've shared the file on dropbox, which you can access via the link
> here:
> > >
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > >
> > > This is what I get from the Tika app after dropping the file in.
> > >
> > > Content-Length: 75092
> > > Content-Type: application/pdf
> > > Type: COSName{Info}
> > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > X-TIKA:digest:SHA256:
> > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > access_permission:assemble_document: true
> > > access_permission:can_modify: true
> > > access_permission:can_print: true
> > > access_permission:can_print_degraded: true
> > > access_permission:extract_content: true
> > > access_permission:extract_for_accessibility: true
> > > access_permission:fill_in_form: true
> > > access_permission:modify_annotations: true
> > > dc:format: application/pdf; version=1.3
> > > pdf:PDFVersion: 1.3
> > > pdf:encrypted: false
> > > producer: null
> > > resourceName: Desmophen+670+BAe.pdf
> > > xmpTPg:NPages: 3
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:15, Erik Hatcher 
> > wrote:
> > >
> > >> Edwin - Can you share one of those PDF files?
> > >>
> > >> Also, drop the file into the Tika app and see what it sees directly -
> > get
> > >> the tika-app JAR and run that desktop application.
> > >>
> > >> Could be an encoding issue?
> > >>
> > >> Erik
> > >>
> > >> —
> > >> Erik Hatcher, Senior Solutions Architect
> > >> http://www.lucidworks.com 
> > >>
> > >>
> > >>
> > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I'm using Solr 5.3.0
> > >> >
> > >> > I'm indexing some PDF documents. However, for certain PDF files,
> there
> > >> are
> > >> > chinese text in the documents, but after indexing, what is indexed
> in
> > >> the
> > >> > content is either a series of "??" or an empty content.
> > >> >
> > >> > I'm using the post.jar that comes together with Solr.
> > >> >
> > >> > What could be the reason that causes this?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >>
> > >>
> > >
> >
>


Re: Issues when indexing PDF files

2015-12-17 Thread Binoy Dalal
You can always write an update handler plugin to convert your PDFs to utf-8
and then push them to solr

On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo  wrote:

> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific tools
> and change the encoding of the file?
> Is there any way to configure it in Solr?
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> wrote:
>
> > They could be using custom fonts and non-Unicode characters. That's
> > probably something to explore with PDF specific tools.
> > On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
> > wrote:
> >
> > > I've checked all the files which has problem with the content in the
> Solr
> > > index using the Tika app. All of them shows the same issues as what I
> see
> > > in the Solr index.
> > >
> > > So does the issues lies with the encoding of the file? Are we able to
> > check
> > > the encoding of the file?
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > > wrote:
> > >
> > > > Hi Erik,
> > > >
> > > > I've shared the file on dropbox, which you can access via the link
> > here:
> > > >
> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > > >
> > > > This is what I get from the Tika app after dropping the file in.
> > > >
> > > > Content-Length: 75092
> > > > Content-Type: application/pdf
> > > > Type: COSName{Info}
> > > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > > X-TIKA:digest:SHA256:
> > > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > > access_permission:assemble_document: true
> > > > access_permission:can_modify: true
> > > > access_permission:can_print: true
> > > > access_permission:can_print_degraded: true
> > > > access_permission:extract_content: true
> > > > access_permission:extract_for_accessibility: true
> > > > access_permission:fill_in_form: true
> > > > access_permission:modify_annotations: true
> > > > dc:format: application/pdf; version=1.3
> > > > pdf:PDFVersion: 1.3
> > > > pdf:encrypted: false
> > > > producer: null
> > > > resourceName: Desmophen+670+BAe.pdf
> > > > xmpTPg:NPages: 3
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 17 December 2015 at 00:15, Erik Hatcher 
> > > wrote:
> > > >
> > > >> Edwin - Can you share one of those PDF files?
> > > >>
> > > >> Also, drop the file into the Tika app and see what it sees directly
> -
> > > get
> > > >> the tika-app JAR and run that desktop application.
> > > >>
> > > >> Could be an encoding issue?
> > > >>
> > > >> Erik
> > > >>
> > > >> —
> > > >> Erik Hatcher, Senior Solutions Architect
> > > >> http://www.lucidworks.com 
> > > >>
> > > >>
> > > >>
> > > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > > edwinye...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > I'm using Solr 5.3.0
> > > >> >
> > > >> > I'm indexing some PDF documents. However, for certain PDF files,
> > there
> > > >> are
> > > >> > chinese text in the documents, but after indexing, what is indexed
> > in
> > > >> the
> > > >> > content is either a series of "??" or an empty content.
> > > >> >
> > > >> > I'm using the post.jar that comes together with Solr.
> > > >> >
> > > >> > What could be the reason that causes this?
> > > >> >
> > > >> > Regards,
> > > >> > Edwin
> > > >>
> > > >>
> > > >
> > >
> >
>
-- 
Regards,
Binoy Dalal


RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.
Generally, I'd recommend opening an issue on PDFBox's Jira with the file that 
you shared.  Tika uses PDFBox...if a fix can be made there, it will propagate 
back through Tika to Solr.

That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode 
mapping for CID+71 (71) in font 505Eddc6Arial

So, if the file has no Unicode mapping for the font, I doubt they'll be able to 
fix it.

pdftotext is also unable to extract anything useful from the file.

Sorry.

Best,

Tim
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, December 17, 2015 5:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Issues when indexing PDF files

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific 
> tools and change the encoding of the file?
> Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
in a way that Tika cannot easily extract the text, there's nothing you can do 
in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a character on a 
PDF page, but exactly how that character is generated (using a specific 
encoding, font, or even by drawing a picture) is outside your control. There 
are various businesses built on this premise
- they charge for creating clean extracted text from PDFs - and even they have 
trouble with some PDFs.

HTH

Charlie

>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> <arafa...@gmail.com>
> wrote:
>
>> They could be using custom fonts and non-Unicode characters. That's 
>> probably something to explore with PDF specific tools.
>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
>> wrote:
>>
>>> I've checked all the files which has problem with the content in the 
>>> Solr index using the Tika app. All of them shows the same issues as 
>>> what I see in the Solr index.
>>>
>>> So does the issues lies with the encoding of the file? Are we able 
>>> to
>> check
>>> the encoding of the file?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
>>> <edwinye...@gmail.com>
>>> wrote:
>>>
>>>> Hi Erik,
>>>>
>>>> I've shared the file on dropbox, which you can access via the link
>> here:
>>>>
>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?d
>> l=0
>>>>
>>>> This is what I get from the Tika app after dropping the file in.
>>>>
>>>> Content-Length: 75092
>>>> Content-Type: application/pdf
>>>> Type: COSName{Info}
>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>>>> X-TIKA:digest:SHA256:
>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>>>> access_permission:assemble_document: true
>>>> access_permission:can_modify: true
>>>> access_permission:can_print: true
>>>> access_permission:can_print_degraded: true
>>>> access_permission:extract_content: true
>>>> access_permission:extract_for_accessibility: true
>>>> access_permission:fill_in_form: true
>>>> access_permission:modify_annotations: true
>>>> dc:format: application/pdf; version=1.3
>>>> pdf:PDFVersion: 1.3
>>>> pdf:encrypted: false
>>>> producer: null
>>>> resourceName: Desmophen+670+BAe.pdf
>>>> xmpTPg:NPages: 3
>>>>
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
>>> wrote:
>>>>
>>>>> Edwin - Can you share one of those PDF files?
>>>>>
>>>>> Also, drop the file into the Tika app and see what it sees 
>>>>> directly -
>>> get
>>>>> the tika-app JAR and run that desktop application.
>>>>>
>>>>> Could be an encoding issue?
>>>>>
>>>>>  Erik
>>>>>
>>>>> —
>>>>> Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com 
>>>>> <http://www.lucidworks.com/>
>>>>>
>>>>>
>>>>>
>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm using Solr 5.3.0
>>>>>>
>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
>> there
>>>>> are
>>>>>> chinese text in the documents, but after indexing, what is 
>>>>>> indexed
>> in
>>>>> the
>>>>>> content is either a series of "??" or an empty content.
>>>>>>
>>>>>> I'm using the post.jar that comes together with Solr.
>>>>>>
>>>>>> What could be the reason that causes this?
>>>>>>
>>>>>> Regards,
>>>>>> Edwin
>>>>>
>>>>>
>>>>
>>>
>>
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood
PDF isn’t really text. For example, it doesn’t have spaces, it just moves the 
next letter over farther. Letters might not be in reading order — two column 
text could be printed as horizontal scans. Custom fonts might not use an 
encoding that matches Unicode, which makes them encrypted (badly). And so on.

As one of my coworkers said, trying to turn a PDF into structured text is like 
trying to turn hamburger back into a cow.

PDF is where text goes to die.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
> 
> On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> Hi Alexandre,
>> 
>> Thanks for your reply.
>> 
>> So the only way to solve this issue is to explore with PDF specific tools
>> and change the encoding of the file?
>> Is there any way to configure it in Solr?
> 
> Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
> in a way that Tika cannot easily extract the text, there's nothing you can do 
> in Solr that will help.
> 
> Unfortunately PDF isn't a content format but a presentation format - so 
> extracting plain text is fraught with difficulty. You may see a character on 
> a PDF page, but exactly how that character is generated (using a specific 
> encoding, font, or even by drawing a picture) is outside your control. There 
> are various businesses built on this premise - they charge for creating clean 
> extracted text from PDFs - and even they have trouble with some PDFs.
> 
> HTH
> 
> Charlie
> 
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
>> wrote:
>> 
>>> They could be using custom fonts and non-Unicode characters. That's
>>> probably something to explore with PDF specific tools.
>>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>>> wrote:
>>> 
 I've checked all the files which has problem with the content in the Solr
 index using the Tika app. All of them shows the same issues as what I see
 in the Solr index.
 
 So does the issues lies with the encoding of the file? Are we able to
>>> check
 the encoding of the file?
 
 
 Regards,
 Edwin
 
 
 On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
 wrote:
 
> Hi Erik,
> 
> I've shared the file on dropbox, which you can access via the link
>>> here:
> 
>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> 
> This is what I get from the Tika app after dropping the file in.
> 
> Content-Length: 75092
> Content-Type: application/pdf
> Type: COSName{Info}
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> X-TIKA:digest:SHA256:
> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> access_permission:assemble_document: true
> access_permission:can_modify: true
> access_permission:can_print: true
> access_permission:can_print_degraded: true
> access_permission:extract_content: true
> access_permission:extract_for_accessibility: true
> access_permission:fill_in_form: true
> access_permission:modify_annotations: true
> dc:format: application/pdf; version=1.3
> pdf:PDFVersion: 1.3
> pdf:encrypted: false
> producer: null
> resourceName: Desmophen+670+BAe.pdf
> xmpTPg:NPages: 3
> 
> 
> Regards,
> Edwin
> 
> 
> On 17 December 2015 at 00:15, Erik Hatcher 
 wrote:
> 
>> Edwin - Can you share one of those PDF files?
>> 
>> Also, drop the file into the Tika app and see what it sees directly -
 get
>> the tika-app JAR and run that desktop application.
>> 
>> Could be an encoding issue?
>> 
>> Erik
>> 
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com 
>> 
>> 
>> 
>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm using Solr 5.3.0
>>> 
>>> I'm indexing some PDF documents. However, for certain PDF files,
>>> there
>> are
>>> chinese text in the documents, but after indexing, what is indexed
>>> in
>> the
>>> content is either a series of "??" or an empty content.
>>> 
>>> I'm using the post.jar that comes together with Solr.
>>> 
>>> What could be the reason that causes this?
>>> 
>>> Regards,
>>> Edwin
>> 
>> 
> 
 
>>> 
>> 
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk



Re: Issues when indexing PDF files

2015-12-17 Thread Charlie Hull

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:

Hi Alexandre,

Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?


Solr uses Tika to extract plain text from PDFs. If the PDFs have been 
created in a way that Tika cannot easily extract the text, there's 
nothing you can do in Solr that will help.


Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a 
character on a PDF page, but exactly how that character is generated 
(using a specific encoding, font, or even by drawing a picture) is 
outside your control. There are various businesses built on this premise 
- they charge for creating clean extracted text from PDFs - and even 
they have trouble with some PDFs.


HTH

Charlie



Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch 
wrote:


They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
wrote:


I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.

So does the issues lies with the encoding of the file? Are we able to

check

the encoding of the file?


Regards,
Edwin


On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
wrote:


Hi Erik,

I've shared the file on dropbox, which you can access via the link

here:



https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher 

wrote:



Edwin - Can you share one of those PDF files?

Also, drop the file into the Tika app and see what it sees directly -

get

the tika-app JAR and run that desktop application.

Could be an encoding issue?

 Erik

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <

edwinye...@gmail.com>

wrote:


Hi,

I'm using Solr 5.3.0

I'm indexing some PDF documents. However, for certain PDF files,

there

are

chinese text in the documents, but after indexing, what is indexed

in

the

content is either a series of "??" or an empty content.

I'm using the post.jar that comes together with Solr.

What could be the reason that causes this?

Regards,
Edwin














--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo
I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.

So does the issues lies with the encoding of the file? Are we able to check
the encoding of the file?


Regards,
Edwin


On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
wrote:

> Hi Erik,
>
> I've shared the file on dropbox, which you can access via the link here:
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>
> This is what I get from the Tika app after dropping the file in.
>
> Content-Length: 75092
> Content-Type: application/pdf
> Type: COSName{Info}
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> X-TIKA:digest:SHA256:
> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> access_permission:assemble_document: true
> access_permission:can_modify: true
> access_permission:can_print: true
> access_permission:can_print_degraded: true
> access_permission:extract_content: true
> access_permission:extract_for_accessibility: true
> access_permission:fill_in_form: true
> access_permission:modify_annotations: true
> dc:format: application/pdf; version=1.3
> pdf:PDFVersion: 1.3
> pdf:encrypted: false
> producer: null
> resourceName: Desmophen+670+BAe.pdf
> xmpTPg:NPages: 3
>
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 00:15, Erik Hatcher  wrote:
>
>> Edwin - Can you share one of those PDF files?
>>
>> Also, drop the file into the Tika app and see what it sees directly - get
>> the tika-app JAR and run that desktop application.
>>
>> Could be an encoding issue?
>>
>> Erik
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com 
>>
>>
>>
>> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo 
>> wrote:
>> >
>> > Hi,
>> >
>> > I'm using Solr 5.3.0
>> >
>> > I'm indexing some PDF documents. However, for certain PDF files, there
>> are
>> > chinese text in the documents, but after indexing, what is indexed in
>> the
>> > content is either a series of "??" or an empty content.
>> >
>> > I'm using the post.jar that comes together with Solr.
>> >
>> > What could be the reason that causes this?
>> >
>> > Regards,
>> > Edwin
>>
>>
>


Re: Issues when indexing PDF files

2015-12-16 Thread Alexandre Rafalovitch
They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo"  wrote:

> I've checked all the files which has problem with the content in the Solr
> index using the Tika app. All of them shows the same issues as what I see
> in the Solr index.
>
> So does the issues lies with the encoding of the file? Are we able to check
> the encoding of the file?
>
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi Erik,
> >
> > I've shared the file on dropbox, which you can access via the link here:
> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >
> > This is what I get from the Tika app after dropping the file in.
> >
> > Content-Length: 75092
> > Content-Type: application/pdf
> > Type: COSName{Info}
> > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > X-TIKA:digest:SHA256:
> > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > access_permission:assemble_document: true
> > access_permission:can_modify: true
> > access_permission:can_print: true
> > access_permission:can_print_degraded: true
> > access_permission:extract_content: true
> > access_permission:extract_for_accessibility: true
> > access_permission:fill_in_form: true
> > access_permission:modify_annotations: true
> > dc:format: application/pdf; version=1.3
> > pdf:PDFVersion: 1.3
> > pdf:encrypted: false
> > producer: null
> > resourceName: Desmophen+670+BAe.pdf
> > xmpTPg:NPages: 3
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:15, Erik Hatcher 
> wrote:
> >
> >> Edwin - Can you share one of those PDF files?
> >>
> >> Also, drop the file into the Tika app and see what it sees directly -
> get
> >> the tika-app JAR and run that desktop application.
> >>
> >> Could be an encoding issue?
> >>
> >> Erik
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com 
> >>
> >>
> >>
> >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'm using Solr 5.3.0
> >> >
> >> > I'm indexing some PDF documents. However, for certain PDF files, there
> >> are
> >> > chinese text in the documents, but after indexing, what is indexed in
> >> the
> >> > content is either a series of "??" or an empty content.
> >> >
> >> > I'm using the post.jar that comes together with Solr.
> >> >
> >> > What could be the reason that causes this?
> >> >
> >> > Regards,
> >> > Edwin
> >>
> >>
> >
>


Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 5.3.0

I'm indexing some PDF documents. However, for certain PDF files, there are
chinese text in the documents, but after indexing, what is indexed in the
content is either a series of "??" or an empty content.

I'm using the post.jar that comes together with Solr.

What could be the reason that causes this?

Regards,
Edwin


Re: Issues when indexing PDF files

2015-12-16 Thread Erik Hatcher
Edwin - Can you share one of those PDF files?

Also, drop the file into the Tika app and see what it sees directly - get the 
tika-app JAR and run that desktop application.

Could be an encoding issue?  

Erik

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 



> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo  
> wrote:
> 
> Hi,
> 
> I'm using Solr 5.3.0
> 
> I'm indexing some PDF documents. However, for certain PDF files, there are
> chinese text in the documents, but after indexing, what is indexed in the
> content is either a series of "??" or an empty content.
> 
> I'm using the post.jar that comes together with Solr.
> 
> What could be the reason that causes this?
> 
> Regards,
> Edwin



Re: Issues when indexing PDF files

2015-12-16 Thread Zheng Lin Edwin Yeo
Hi Erik,

I've shared the file on dropbox, which you can access via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0

This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher  wrote:

> Edwin - Can you share one of those PDF files?
>
> Also, drop the file into the Tika app and see what it sees directly - get
> the tika-app JAR and run that desktop application.
>
> Could be an encoding issue?
>
> Erik
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com 
>
>
>
> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > I'm using Solr 5.3.0
> >
> > I'm indexing some PDF documents. However, for certain PDF files, there
> are
> > chinese text in the documents, but after indexing, what is indexed in the
> > content is either a series of "??" or an empty content.
> >
> > I'm using the post.jar that comes together with Solr.
> >
> > What could be the reason that causes this?
> >
> > Regards,
> > Edwin
>
>


RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you 
can -- bad things can happen if you don't [1] [2].

Erick's blog on SolrJ is fantastic.  If you want to have Tika parse embedded 
documents/attachments, make sure to set the parser in the ParseContext before 
parsing:

ParseContext context = new ParseContext();
//add this line:
context.set(Parser.class, _autoParser)
 InputStream input = new FileInputStream(file);

Tika 1.8 is soon to be released.  If that doesn't fix your problems, please 
submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to 
make the fixes.  

Cheers,

Tim

[1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf 
[2] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 
-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to Solr, in
 a
 production environment the Solr server is responsible for indexing,
 parsing the
 docs through Tika, perhaps searching etc. This doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally independent
 of
 what version of Tika is on the Solr server. Here's an example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 vijaya.bhoomire...@whishworks.com wrote:
  Thanks everyone for the responses. Now I am able to index PDF documents
  successfully. I have implemented manual extraction using Tika's
 AutoParser
  and PDF functionality is working fine. However,  the error with some MS
  office word documents still persist.
 
  The error message is java.lang.IllegalArgumentException: This paragraph
 is
  not the first one in the table which will eventually result in
 Unexpected
  RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
 
  Upon some reading, it looks like its a bug with Tika 1.5 and seems to
 have
  been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
  I am new to Solr / Tika and hence wondering whether I can change the Tika
  library alone to v1.6 without impacting any of the libraries within Solr
  4.10.2? Please let me know your response and how to get away with this
  issue.
 
  Many thanks in advance.
 
  Thanks  Regards
  Vijay
 
 
  On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
 
  Vijay,
 
  You could try different excel files with different formats to rule out
 the
  issue is with TIKA version being used.
 
  Thanks
  Murthy
 
  On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
  wrote:
 
   Perhaps the PDF is protected and the content can not be extracted?
  
   i have an unverified suspicion that the tika shipped with solr 4.10.2
 may
   not support some/all office 2013 document formats.
  
  
  
  
  
   On 4/14/2015 8:18 PM, Jack Krupansky wrote:
  
   Try doing a manual extraction request directly to Solr (not via
 SolrJ)
  and
   use the extractOnly option to see if the content is actually
 extracted.
  
   See:
   https://cwiki.apache.org/confluence/display/solr/
   Uploading+Data+with+Solr+Cell+using+Apache+Tika
  
   Also, some PDF files actually have the content as a bitmap image, so
 no
   text is extracted.
  
  
   -- Jack Krupansky
  
   On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy
 
   vijaya.bhoomire...@whishworks.com wrote:
  
Hi,
  
   I am trying to index PDF and Microsoft Office files (.doc, .docx,
 .ppt,
   .pptx, .xlx, and .xlx) files into Solr. I am facing the following
  issues.
   Request to please let me know what is going wrong with the indexing
   process.
  
   I am using solr 4.10.2 and using the default example server
  configuration
   that comes with Solr distribution.
  
   PDF Files - Indexing as such works fine, but when I query using *.*
 in
   the
   Solr Query console, metadata information is displayed properly.
  However,
   the PDF content field is empty. This is happening for all PDF files

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
This sounds like a Tika issue, let's move discussion to that list.

If you are still having problems after you upgrade to Tika 1.8, please at least 
submit the stack traces (if you can) to the Tika jira.  We may be able to find 
a document that triggers that stack trace in govdocs1 or the slice of 
CommonCrawl that Julien Nioche contributed to our eval effort.

Tika is not perfect and it will fail on some files, but we are always working 
to improve it.

Best,

  Tim

-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks  Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

 I entirely agree with Erick -- it is best to isolate Tika in its own jvm
 if you can -- bad things can happen if you don't [1] [2].

 Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
 embedded documents/attachments, make sure to set the parser in the
 ParseContext before parsing:

 ParseContext context = new ParseContext();
 //add this line:
 context.set(Parser.class, _autoParser)
  InputStream input = new FileInputStream(file);

 Tika 1.8 is soon to be released.  If that doesn't fix your problems,
 please submit stacktraces (and docs, if possible) to the Tika jira, and
 we'll try to make the fixes.

 Cheers,

 Tim

 [1]
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
 [2]
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
 are getting parsed properly and indexed into Solr. However, a minority of
 them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents can be
 indexed. The business requirement we have is to index all the documents.
 However, if a small percentage of them fails, not sure what other ways
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

  There's quite a discussion here:
  https://issues.apache.org/jira/browse/SOLR-7137
 
  But, I personally am not a huge fan of pushing all the work on to Solr,
 in
  a
  production environment the Solr server is responsible for indexing,
  parsing the
  docs through Tika, perhaps searching etc. This doesn't scale all that
 well.
 
  So an alternative is to use SolrJ with Tika, which is totally independent
  of
  what version of Tika is on the Solr server. Here's an example.
 
  http://lucidworks.com/blog/indexing-with-solrj/
 
  Best,
  Erick
 
  On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
  vijaya.bhoomire...@whishworks.com wrote:
   Thanks everyone for the responses. Now I am able to index PDF documents
   successfully. I have implemented manual extraction using Tika's
  AutoParser
   and PDF functionality is working fine. However,  the error with some MS
   office word documents still persist.
  
   The error message is java.lang.IllegalArgumentException: This
 paragraph
  is
   not the first one in the table which will eventually result in
  Unexpected
   RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
  
   Upon some reading, it looks like its a bug with Tika 1.5 and seems to
  have
   been fixed with Tika 1.6 (
  https://issues.apache.org/jira/browse/TIKA-1251 ).
   I am new to Solr / Tika and hence wondering whether I can change the
 Tika
   library alone to v1.6 without impacting any of the libraries within
 Solr
   4.10.2? Please let me know your response and how to get away with this
   issue.
  
   Many thanks in advance.
  
   Thanks  Regards
   Vijay
  
  
   On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
  
   Vijay,
  
   You could try different excel files with different formats to rule out
  the
   issue is with TIKA version being used.
  
   Thanks
   Murthy
  
   On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
   wrote

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
+1 

:)

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)



Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to Solr, in
 a
 production environment the Solr server is responsible for indexing,
 parsing the
 docs through Tika, perhaps searching etc. This doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally independent
 of
 what version of Tika is on the Solr server. Here's an example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 vijaya.bhoomire...@whishworks.com wrote:
  Thanks everyone for the responses. Now I am able to index PDF documents
  successfully. I have implemented manual extraction using Tika's
 AutoParser
  and PDF functionality is working fine. However,  the error with some MS
  office word documents still persist.
 
  The error message is java.lang.IllegalArgumentException: This paragraph
 is
  not the first one in the table which will eventually result in
 Unexpected
  RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
 
  Upon some reading, it looks like its a bug with Tika 1.5 and seems to
 have
  been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
  I am new to Solr / Tika and hence wondering whether I can change the Tika
  library alone to v1.6 without impacting any of the libraries within Solr
  4.10.2? Please let me know your response and how to get away with this
  issue.
 
  Many thanks in advance.
 
  Thanks  Regards
  Vijay
 
 
  On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
 
  Vijay,
 
  You could try different excel files with different formats to rule out
 the
  issue is with TIKA version being used.
 
  Thanks
  Murthy
 
  On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
  wrote:
 
   Perhaps the PDF is protected and the content can not be extracted?
  
   i have an unverified suspicion that the tika shipped with solr 4.10.2
 may
   not support some/all office 2013 document formats.
  
  
  
  
  
   On 4/14/2015 8:18 PM, Jack Krupansky wrote:
  
   Try doing a manual extraction request directly to Solr (not via
 SolrJ)
  and
   use the extractOnly option to see if the content is actually
 extracted.
  
   See:
   https://cwiki.apache.org/confluence/display/solr/
   Uploading+Data+with+Solr+Cell+using+Apache+Tika
  
   Also, some PDF files actually have the content as a bitmap image, so
 no
   text is extracted.
  
  
   -- Jack Krupansky
  
   On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy
 
   vijaya.bhoomire...@whishworks.com wrote:
  
Hi,
  
   I am trying to index PDF and Microsoft Office files (.doc, .docx,
 .ppt,
   .pptx, .xlx, and .xlx) files into Solr. I am facing the following
  issues.
   Request to please let me know what is going wrong with the indexing
   process.
  
   I am using solr 4.10.2 and using the default example server
  configuration
   that comes with Solr distribution.
  
   PDF Files - Indexing as such works fine, but when I query using *.*
 in
   the
   Solr Query console, metadata information is displayed properly.
  However,
   the PDF content field is empty. This is happening for all PDF files
 I
   have
   tried. I have tried with some proprietary files, PDF eBooks etc.
  Whatever
   be the PDF file, content is not being displayed.
  
   MS Office files -  For some office files, everything works perfect
 and
   the
   extracted content is visible in the query console. However, for
  others, I
   see the below error message during the indexing process.
  
   *Exception in thread main
  
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
   org.apache.tika.exception.TikaException: Unexpected RuntimeException
   from
   org.apache.tika.parser.microsoft.OfficeParser*
  
  
   I am using SolrJ to index the documents and below is the code
 snippet
   related to indexing. Please let me know where the issue is
 occurring.
  
static String solrServerURL = 
   http://localhost:8983/solr;;
   static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest
 indexingReq
  =
   new
  
ContentStreamUpdateRequest(/update/extract);
  
   

Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)


Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)


On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This paragraph

is

not the first one in the table which will eventually result in

Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy



vijaya.bhoomire...@whishworks.com wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread main


org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:

org.apache.tika.exception.TikaException: Unexpected RuntimeException
from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code

snippet

related to 

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks  Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

 I entirely agree with Erick -- it is best to isolate Tika in its own jvm
 if you can -- bad things can happen if you don't [1] [2].

 Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
 embedded documents/attachments, make sure to set the parser in the
 ParseContext before parsing:

 ParseContext context = new ParseContext();
 //add this line:
 context.set(Parser.class, _autoParser)
  InputStream input = new FileInputStream(file);

 Tika 1.8 is soon to be released.  If that doesn't fix your problems,
 please submit stacktraces (and docs, if possible) to the Tika jira, and
 we'll try to make the fixes.

 Cheers,

 Tim

 [1]
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
 [2]
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
 are getting parsed properly and indexed into Solr. However, a minority of
 them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents can be
 indexed. The business requirement we have is to index all the documents.
 However, if a small percentage of them fails, not sure what other ways
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

  There's quite a discussion here:
  https://issues.apache.org/jira/browse/SOLR-7137
 
  But, I personally am not a huge fan of pushing all the work on to Solr,
 in
  a
  production environment the Solr server is responsible for indexing,
  parsing the
  docs through Tika, perhaps searching etc. This doesn't scale all that
 well.
 
  So an alternative is to use SolrJ with Tika, which is totally independent
  of
  what version of Tika is on the Solr server. Here's an example.
 
  http://lucidworks.com/blog/indexing-with-solrj/
 
  Best,
  Erick
 
  On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
  vijaya.bhoomire...@whishworks.com wrote:
   Thanks everyone for the responses. Now I am able to index PDF documents
   successfully. I have implemented manual extraction using Tika's
  AutoParser
   and PDF functionality is working fine. However,  the error with some MS
   office word documents still persist.
  
   The error message is java.lang.IllegalArgumentException: This
 paragraph
  is
   not the first one in the table which will eventually result in
  Unexpected
   RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
  
   Upon some reading, it looks like its a bug with Tika 1.5 and seems to
  have
   been fixed with Tika 1.6 (
  https://issues.apache.org/jira/browse/TIKA-1251 ).
   I am new to Solr / Tika and hence wondering whether I can change the
 Tika
   library alone to v1.6 without impacting any of the libraries within
 Solr
   4.10.2? Please let me know your response and how to get away with this
   issue.
  
   Many thanks in advance.
  
   Thanks  Regards
   Vijay
  
  
   On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
  
   Vijay,
  
   You could try different excel files with different formats to rule out
  the
   issue is with TIKA version being used.
  
   Thanks
   Murthy
  
   On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
   wrote:
  
Perhaps the PDF is protected and the content can not be extracted?
   
i have an unverified suspicion that the tika shipped with solr
 4.10.2
  may
not support some/all office 2013 document formats.
   
   
   
   
   
On 4/14/2015 8:18 PM, Jack Krupansky wrote:
   
Try doing a manual extraction request directly to Solr (not via
  SolrJ)
   and
use the extractOnly option to see if the content is actually
  extracted.
   
See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika
   
Also, some PDF files actually have the content as a bitmap image,
 so
  no
text is extracted.
   
   
-- Jack Krupansky
   
On Tue, Apr

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Tim.

I shall raise a Jira with the stack trace information.

Thanks  Regards
Vijay


On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

 This sounds like a Tika issue, let's move discussion to that list.

 If you are still having problems after you upgrade to Tika 1.8, please at
 least submit the stack traces (if you can) to the Tika jira.  We may be
 able to find a document that triggers that stack trace in govdocs1 or the
 slice of CommonCrawl that Julien Nioche contributed to our eval effort.

 Tika is not perfect and it will fail on some files, but we are always
 working to improve it.

 Best,

   Tim

 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Thanks Allison.

 I tried with the mentioned changes. But still no luck. I am using the code
 from lucidworks site provided by Erick and now included the changes
 mentioned by you. But still the issue persists with a small percentage of
 documents (both PDF and MS Office documents) failing. Unfortunately, these
 documents are proprietary and client-confidential and hence I am not sure
 whether they can be uploaded into Jira.

 These files normally open in Adobe Reader and MS Office tools.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

  I entirely agree with Erick -- it is best to isolate Tika in its own jvm
  if you can -- bad things can happen if you don't [1] [2].
 
  Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
  embedded documents/attachments, make sure to set the parser in the
  ParseContext before parsing:
 
  ParseContext context = new ParseContext();
  //add this line:
  context.set(Parser.class, _autoParser)
   InputStream input = new FileInputStream(file);
 
  Tika 1.8 is soon to be released.  If that doesn't fix your problems,
  please submit stacktraces (and docs, if possible) to the Tika jira, and
  we'll try to make the fixes.
 
  Cheers,
 
  Tim
 
  [1]
 
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
  [2]
 
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
  -Original Message-
  From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
  vijaya.bhoomire...@whishworks.com]
  Sent: Thursday, April 16, 2015 7:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing PDF and MS Office files
 
  Erick,
 
  I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
  SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
  are getting parsed properly and indexed into Solr. However, a minority of
  them keep failing wither PDFParser or OfficeParser error.
 
  Not sure if this behaviour can be modified so that all the documents can
 be
  indexed. The business requirement we have is to index all the documents.
  However, if a small percentage of them fails, not sure what other ways
  exist to index them.
 
  Any help please?
 
 
  Thanks  Regards
  Vijay
 
 
 
  On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
 wrote:
 
   There's quite a discussion here:
   https://issues.apache.org/jira/browse/SOLR-7137
  
   But, I personally am not a huge fan of pushing all the work on to Solr,
  in
   a
   production environment the Solr server is responsible for indexing,
   parsing the
   docs through Tika, perhaps searching etc. This doesn't scale all that
  well.
  
   So an alternative is to use SolrJ with Tika, which is totally
 independent
   of
   what version of Tika is on the Solr server. Here's an example.
  
   http://lucidworks.com/blog/indexing-with-solrj/
  
   Best,
   Erick
  
   On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
   vijaya.bhoomire...@whishworks.com wrote:
Thanks everyone for the responses. Now I am able to index PDF
 documents
successfully. I have implemented manual extraction using Tika's
   AutoParser
and PDF functionality is working fine. However,  the error with some
 MS
office word documents still persist.
   
The error message is java.lang.IllegalArgumentException: This
  paragraph
   is
not the first one in the table which will eventually result in
   Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
   
Upon some reading, it looks like its a bug with Tika 1.5 and seems to
   have
been fixed with Tika 1.6 (
   https://issues.apache.org/jira/browse/TIKA-1251 ).
I am new to Solr / Tika and hence wondering whether I can change the
  Tika
library alone to v1.6 without impacting any of the libraries within
  Solr
4.10.2? Please let me know your response and how to get away with
 this
issue.
   
Many thanks in advance.
   
Thanks  Regards
Vijay
   
   
On 15 April 2015

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
If you use pdftotext with a simple fork/exec per document, you will get about 5 
MB/s throughput on a single AMD x86_64.   Much of that is because of the 
fork/exec.   I suggest that you use HTML output and UTF-8 encoding  for the 
PDF, because that way you can get title/keywords and such as http meta keywords.

If you have the appetite for something truly great, try:
 - Socket server listening for parsing requests
 - pass off accept() sockets to pre-forked children
 - in the children, use vfork, rather than fork
 -  tmpfs for outputted HTML documents
 - Tempting to implement using mod_perl and httpd, at least to me.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
Sent: Thursday, April 16, 2015 7:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never ever 
successfully all real-world PDFs and cater for that fact in your requirements 
:-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:
 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as 
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word 
 documents are getting parsed properly and indexed into Solr. However, 
 a minority of them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents 
 can be indexed. The business requirement we have is to index all the 
 documents.
 However, if a small percentage of them fails, not sure what other ways 
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to 
 Solr, in a production environment the Solr server is responsible for 
 indexing, parsing the docs through Tika, perhaps searching etc. This 
 doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally 
 independent of what version of Tika is on the Solr server. Here's an 
 example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy 
 vijaya.bhoomire...@whishworks.com wrote:
 Thanks everyone for the responses. Now I am able to index PDF 
 documents successfully. I have implemented manual extraction using 
 Tika's
 AutoParser
 and PDF functionality is working fine. However,  the error with some 
 MS office word documents still persist.

 The error message is java.lang.IllegalArgumentException: This 
 paragraph
 is
 not the first one in the table which will eventually result in
 Unexpected
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

 Upon some reading, it looks like its a bug with Tika 1.5 and seems 
 to
 have
 been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
 I am new to Solr / Tika and hence wondering whether I can change the 
 Tika library alone to v1.6 without impacting any of the libraries 
 within Solr 4.10.2? Please let me know your response and how to get 
 away with this issue.

 Many thanks in advance.

 Thanks  Regards
 Vijay


 On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:

 Vijay,

 You could try different excel files with different formats to rule 
 out
 the
 issue is with TIKA version being used.

 Thanks
 Murthy

 On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
 trhodes...@gmail.com
 wrote:

 Perhaps the PDF is protected and the content can not be extracted?

 i have an unverified suspicion that the tika shipped with solr 
 4.10.2
 may
 not support some/all office 2013 document formats.





 On 4/14/2015 8:18 PM, Jack Krupansky wrote:

 Try doing a manual extraction request directly to Solr (not via
 SolrJ)
 and
 use the extractOnly option to see if the content is actually
 extracted.

 See:
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Data+with+Solr+Cell+using+Apache+Tika

 Also, some PDF files actually have the content as a bitmap image, 
 so
 no
 text is extracted.


 -- Jack Krupansky

 On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi 
 Reddy
 
 vijaya.bhoomire...@whishworks.com wrote:

   Hi,

 I am trying to index PDF and Microsoft Office files (.doc, 
 .docx,
 .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the 
 following
 issues.
 Request to please let me know what is going wrong with the 
 indexing process.

 I am using solr 4.10.2 and using the default example server
 configuration

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
Indeed.   Another solution is to purchase ABBYY or Nuance as a server, and have 
them do that work.
You will even get OCR.Both offer a Linux SDK.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, April 16, 2015 7:56 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing PDF and MS Office files

+1

:)

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)



Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.

For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.


Thanks  Regards
Vijay



On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Thanks Tim.

 I shall raise a Jira with the stack trace information.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

 This sounds like a Tika issue, let's move discussion to that list.

 If you are still having problems after you upgrade to Tika 1.8, please at
 least submit the stack traces (if you can) to the Tika jira.  We may be
 able to find a document that triggers that stack trace in govdocs1 or the
 slice of CommonCrawl that Julien Nioche contributed to our eval effort.

 Tika is not perfect and it will fail on some files, but we are always
 working to improve it.

 Best,

   Tim

 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Thanks Allison.

 I tried with the mentioned changes. But still no luck. I am using the code
 from lucidworks site provided by Erick and now included the changes
 mentioned by you. But still the issue persists with a small percentage of
 documents (both PDF and MS Office documents) failing. Unfortunately, these
 documents are proprietary and client-confidential and hence I am not sure
 whether they can be uploaded into Jira.

 These files normally open in Adobe Reader and MS Office tools.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org
 wrote:

  I entirely agree with Erick -- it is best to isolate Tika in its own jvm
  if you can -- bad things can happen if you don't [1] [2].
 
  Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
  embedded documents/attachments, make sure to set the parser in the
  ParseContext before parsing:
 
  ParseContext context = new ParseContext();
  //add this line:
  context.set(Parser.class, _autoParser)
   InputStream input = new FileInputStream(file);
 
  Tika 1.8 is soon to be released.  If that doesn't fix your problems,
  please submit stacktraces (and docs, if possible) to the Tika jira, and
  we'll try to make the fixes.
 
  Cheers,
 
  Tim
 
  [1]
 
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
  [2]
 
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
  -Original Message-
  From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
  vijaya.bhoomire...@whishworks.com]
  Sent: Thursday, April 16, 2015 7:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing PDF and MS Office files
 
  Erick,
 
  I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
  SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
  are getting parsed properly and indexed into Solr. However, a minority
 of
  them keep failing wither PDFParser or OfficeParser error.
 
  Not sure if this behaviour can be modified so that all the documents
 can be
  indexed. The business requirement we have is to index all the documents.
  However, if a small percentage of them fails, not sure what other ways
  exist to index them.
 
  Any help please?
 
 
  Thanks  Regards
  Vijay
 
 
 
  On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
 wrote:
 
   There's quite a discussion here:
   https://issues.apache.org/jira/browse/SOLR-7137
  
   But, I personally am not a huge fan of pushing all the work on to
 Solr,
  in
   a
   production environment the Solr server is responsible for indexing,
   parsing the
   docs through Tika, perhaps searching etc. This doesn't scale all that
  well.
  
   So an alternative is to use SolrJ with Tika, which is totally
 independent
   of
   what version of Tika is on the Solr server. Here's an example.
  
   http://lucidworks.com/blog/indexing-with-solrj/
  
   Best,
   Erick
  
   On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
   vijaya.bhoomire...@whishworks.com wrote:
Thanks everyone for the responses. Now I am able to index PDF
 documents
successfully. I have implemented manual extraction using Tika's
   AutoParser
and PDF functionality is working fine. However,  the error with
 some MS
office word documents still persist.
   
The error message is java.lang.IllegalArgumentException: This
  paragraph
   is
not the first one in the table which

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull

On 16/04/2015 12:53, Siegfried Goeschl wrote:

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)


Here's some file extractors we built a while ago:
https://github.com/flaxsearch/flaxcode/tree/master/flax_filters
You might find them useful: they use a number of external programs 
including pdf2text and headless Open Office.


Cheers

Charlie


If you start command line tools from your JVM please have a look at
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never
ever successfully all real-world PDFs and cater for that fact in your
requirements :-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents
can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to
Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that
well.

So an alternative is to use SolrJ with Tika, which is totally
independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This
paragraph

is

not the first one in the table which will eventually result in

Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the
Tika
library alone to v1.6 without impacting any of the libraries within
Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy



vijaya.bhoomire...@whishworks.com wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread 

Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood
Turning PDF back into a structured document is like trying to turn hamburger 
back into a cow.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. talli...@mitre.org wrote:

 +1 
 
 :)
 
 PS: one more thing - please, tell your management that you will never 
 ever successfully all real-world PDFs and cater for that fact in your 
 requirements :-)
 



Re: Indexing PDF and MS Office files

2015-04-15 Thread Erick Erickson
There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in a
production environment the Solr server is responsible for indexing, parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
 Thanks everyone for the responses. Now I am able to index PDF documents
 successfully. I have implemented manual extraction using Tika's AutoParser
 and PDF functionality is working fine. However,  the error with some MS
 office word documents still persist.

 The error message is java.lang.IllegalArgumentException: This paragraph is
 not the first one in the table which will eventually result in Unexpected
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

 Upon some reading, it looks like its a bug with Tika 1.5 and seems to have
 been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ).
 I am new to Solr / Tika and hence wondering whether I can change the Tika
 library alone to v1.6 without impacting any of the libraries within Solr
 4.10.2? Please let me know your response and how to get away with this
 issue.

 Many thanks in advance.

 Thanks  Regards
 Vijay


 On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:

 Vijay,

 You could try different excel files with different formats to rule out the
 issue is with TIKA version being used.

 Thanks
 Murthy

 On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
 wrote:

  Perhaps the PDF is protected and the content can not be extracted?
 
  i have an unverified suspicion that the tika shipped with solr 4.10.2 may
  not support some/all office 2013 document formats.
 
 
 
 
 
  On 4/14/2015 8:18 PM, Jack Krupansky wrote:
 
  Try doing a manual extraction request directly to Solr (not via SolrJ)
 and
  use the extractOnly option to see if the content is actually extracted.
 
  See:
  https://cwiki.apache.org/confluence/display/solr/
  Uploading+Data+with+Solr+Cell+using+Apache+Tika
 
  Also, some PDF files actually have the content as a bitmap image, so no
  text is extracted.
 
 
  -- Jack Krupansky
 
  On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy 
  vijaya.bhoomire...@whishworks.com wrote:
 
   Hi,
 
  I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
  .pptx, .xlx, and .xlx) files into Solr. I am facing the following
 issues.
  Request to please let me know what is going wrong with the indexing
  process.
 
  I am using solr 4.10.2 and using the default example server
 configuration
  that comes with Solr distribution.
 
  PDF Files - Indexing as such works fine, but when I query using *.* in
  the
  Solr Query console, metadata information is displayed properly.
 However,
  the PDF content field is empty. This is happening for all PDF files I
  have
  tried. I have tried with some proprietary files, PDF eBooks etc.
 Whatever
  be the PDF file, content is not being displayed.
 
  MS Office files -  For some office files, everything works perfect and
  the
  extracted content is visible in the query console. However, for
 others, I
  see the below error message during the indexing process.
 
  *Exception in thread main
  org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
  org.apache.tika.exception.TikaException: Unexpected RuntimeException
  from
  org.apache.tika.parser.microsoft.OfficeParser*
 
 
  I am using SolrJ to index the documents and below is the code snippet
  related to indexing. Please let me know where the issue is occurring.
 
   static String solrServerURL = 
  http://localhost:8983/solr;;
  static SolrServer solrServer = new HttpSolrServer(solrServerURL);
   static ContentStreamUpdateRequest indexingReq
 =
  new
 
   ContentStreamUpdateRequest(/update/extract);
 
   indexingReq.addFile(file, fileType);
  indexingReq.setParam(literal.id, literalId);
  indexingReq.setParam(uprefix, attr_);
  indexingReq.setParam(fmap.content, content);
  indexingReq.setParam(literal.fileurl, fileURL);
  indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  solrServer.request(indexingReq);
 
  Thanks  Regards
  Vijay
 
  --
  The contents of this e-mail are confidential and for the exclusive use
 of
  the intended recipient. If you receive this e-mail in error please
 delete
  it from your system immediately and notify us either by e-mail or
  telephone. You should not copy, forward or otherwise disclose the
 content
  of the e-mail. The views expressed in this communication may not
  necessarily be the view held 

Re: Indexing PDF and MS Office files

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's AutoParser
and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This paragraph is
not the first one in the table which will eventually result in Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to have
been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ).
I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:

 Vijay,

 You could try different excel files with different formats to rule out the
 issue is with TIKA version being used.

 Thanks
 Murthy

 On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
 wrote:

  Perhaps the PDF is protected and the content can not be extracted?
 
  i have an unverified suspicion that the tika shipped with solr 4.10.2 may
  not support some/all office 2013 document formats.
 
 
 
 
 
  On 4/14/2015 8:18 PM, Jack Krupansky wrote:
 
  Try doing a manual extraction request directly to Solr (not via SolrJ)
 and
  use the extractOnly option to see if the content is actually extracted.
 
  See:
  https://cwiki.apache.org/confluence/display/solr/
  Uploading+Data+with+Solr+Cell+using+Apache+Tika
 
  Also, some PDF files actually have the content as a bitmap image, so no
  text is extracted.
 
 
  -- Jack Krupansky
 
  On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy 
  vijaya.bhoomire...@whishworks.com wrote:
 
   Hi,
 
  I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
  .pptx, .xlx, and .xlx) files into Solr. I am facing the following
 issues.
  Request to please let me know what is going wrong with the indexing
  process.
 
  I am using solr 4.10.2 and using the default example server
 configuration
  that comes with Solr distribution.
 
  PDF Files - Indexing as such works fine, but when I query using *.* in
  the
  Solr Query console, metadata information is displayed properly.
 However,
  the PDF content field is empty. This is happening for all PDF files I
  have
  tried. I have tried with some proprietary files, PDF eBooks etc.
 Whatever
  be the PDF file, content is not being displayed.
 
  MS Office files -  For some office files, everything works perfect and
  the
  extracted content is visible in the query console. However, for
 others, I
  see the below error message during the indexing process.
 
  *Exception in thread main
  org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
  org.apache.tika.exception.TikaException: Unexpected RuntimeException
  from
  org.apache.tika.parser.microsoft.OfficeParser*
 
 
  I am using SolrJ to index the documents and below is the code snippet
  related to indexing. Please let me know where the issue is occurring.
 
   static String solrServerURL = 
  http://localhost:8983/solr;;
  static SolrServer solrServer = new HttpSolrServer(solrServerURL);
   static ContentStreamUpdateRequest indexingReq
 =
  new
 
   ContentStreamUpdateRequest(/update/extract);
 
   indexingReq.addFile(file, fileType);
  indexingReq.setParam(literal.id, literalId);
  indexingReq.setParam(uprefix, attr_);
  indexingReq.setParam(fmap.content, content);
  indexingReq.setParam(literal.fileurl, fileURL);
  indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  solrServer.request(indexingReq);
 
  Thanks  Regards
  Vijay
 
  --
  The contents of this e-mail are confidential and for the exclusive use
 of
  the intended recipient. If you receive this e-mail in error please
 delete
  it from your system immediately and notify us either by e-mail or
  telephone. You should not copy, forward or otherwise disclose the
 content
  of the e-mail. The views expressed in this communication may not
  necessarily be the view held by WHISHWORKS.
 
 
 


 --
 Ph: 9845704792


-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

Here are the solr-config xml and the error log from Solr logs for your
reference. As mentioned earlier, I didnt make any changes to the
solr-config.xml as I am using the xml file out of the box one that came
with the default installation.

Please let me know your thoughts on why these issues are occuring.

Thanks  Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww http://www.whishworks.com/w.whishworks.com
http://www.whishworks.com/

https://www.linkedin.com/company/whishworks
http://www.whishworks.com/blog/  https://twitter.com/WHISHWORKS
https://www.facebook.com/whishworksit

On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Hi,

 I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
 Request to please let me know what is going wrong with the indexing
 process.

 I am using solr 4.10.2 and using the default example server configuration
 that comes with Solr distribution.

 PDF Files - Indexing as such works fine, but when I query using *.* in the
 Solr Query console, metadata information is displayed properly. However,
 the PDF content field is empty. This is happening for all PDF files I have
 tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
 be the PDF file, content is not being displayed.

 MS Office files -  For some office files, everything works perfect and the
 extracted content is visible in the query console. However, for others, I
 see the below error message during the indexing process.

 *Exception in thread main
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.microsoft.OfficeParser*


 I am using SolrJ to index the documents and below is the code snippet
 related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = 
 http://localhost:8983/solr;;
 static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq =
 new
 ContentStreamUpdateRequest(/update/extract);

 indexingReq.addFile(file, fileType);
 indexingReq.setParam(literal.id, literalId);
 indexingReq.setParam(uprefix, attr_);
 indexingReq.setParam(fmap.content, content);
 indexingReq.setParam(literal.fileurl, fileURL);
 indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
 solrServer.request(indexingReq);

 Thanks  Regards
 Vijay




-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.
?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

!-- 
 For more details about configurations options that may appear in
 this file, see http://wiki.apache.org/solr/SolrConfigXml. 
--
config
  !-- In all configuration below, a prefix of solr. for class names
   is an alias that causes solr to search appropriate packages,
   including org.apache.solr.(search|update|request|core|analysis)

   You may also specify a fully qualified Java classname if you
   have your own custom plugins.
--

  !-- Controls what version of Lucene various components of Solr
   adhere to.  Generally, you want to use the latest version to
   get all bug fixes and improvements. It is highly recommended
   that you fully re-index after changing this setting as it can
   affect both how text is indexed and queried.
  --
  luceneMatchVersion4.10.2/luceneMatchVersion

  !-- lib/ directives can be used to instruct Solr to load any Jars
   identified and use them to resolve any plugins specified in
   your solrconfig.xml or schema.xml (ie: Analyzers, Request
 

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.


org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

It seems something like https://issues.apache.org/jira/browse/TIKA-1251.
I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems 
to be fixed in Tika 1.6.


I agree with Erik: you should try with another version of Tika.

Best,
Andrea

On 04/14/2015 06:44 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.


org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at

Re: Indexing PDF and MS Office files

2015-04-14 Thread Erick Erickson
looks like this is just a file that Tika can't handle, based on this line:

bq: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

You might be able to get some joy from parsing this from Java and see
if a more recent Tika would fix it. Here's some  sample code:

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Tue, Apr 14, 2015 at 9:44 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
 Andrea,

 Yes, I am using the stock schema.xml that comes with the example server of
 Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
 put into the content field in the index.

 Please find the log information for the Parsing error below.


 org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.microsoft.OfficeParser@138b0c5
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
 Caused by: org.apache.tika.exception.TikaException: Unexpected
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
 ... 32 more
 Caused by: java.lang.IllegalArgumentException: This paragraph is not the
 first one in the table
 at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
 at
 org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
 at
 org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
 at
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
 at
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
 at
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 

Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread main
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

static String solrServerURL = 
http://localhost:8983/solr;;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest indexingReq = new

ContentStreamUpdateRequest(/update/extract);

indexingReq.addFile(file, fileType);
indexingReq.setParam(literal.id, literalId);
indexingReq.setParam(uprefix, attr_);
indexingReq.setParam(fmap.content, content);
indexingReq.setParam(literal.fileurl, fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks  Regards
Vijay

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi Vijay,
Please paste an extract of your schema, where the content field (the 
field where the PDF text shoudl be) and its type are declared.

For the other issue, please paste the whole stacktrace because

org.apache.tika.parser.microsoft.OfficeParser*

says nothing. The complete stacktrace (or at least another three / four 
lines) should contain some other detail.


Best,
Andrea

On 04/14/2015 04:57 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread main
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = 
http://localhost:8983/solr;;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq = new

 ContentStreamUpdateRequest(/update/extract);

 indexingReq.addFile(file, fileType);
indexingReq.setParam(literal.id, literalId);
indexingReq.setParam(uprefix, attr_);
indexingReq.setParam(fmap.content, content);
indexingReq.setParam(literal.fileurl, fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks  Regards
Vijay





Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini

Hi,
solrconfig.xml (especially if you didn't touch it) should be good. What 
about the schema? Are you using the one that comes with the download 
bundle, too?


I don't see the stacktrace..did you forget to paste it?

Best,
Andrea

On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

Here are the solr-config xml and the error log from Solr logs for your 
reference. As mentioned earlier, I didnt make any changes to the 
solr-config.xml as I am using the xml file out of the box one that 
came with the default installation.


Please let me know your thoughts on why these issues are occuring.

Thanks  Regards
Vijay




*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

*T:+44 20 3475 7980*
*M:**+44 7481 298 360*
*W: *ww http://www.whishworks.com/w.whishworks.com 
http://www.whishworks.com/


https://www.linkedin.com/company/whishworkshttp://www.whishworks.com/blog/https://twitter.com/WHISHWORKShttps://www.facebook.com/whishworksit


On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com 
mailto:vijaya.bhoomire...@whishworks.com wrote:


Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx,
.ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the
following issues. Request to please let me know what is going
wrong with the indexing process.

I am using solr 4.10.2 and using the default example server
configuration that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using
*.* in the Solr Query console, metadata information is displayed
properly. However, the PDF content field is empty. This is
happening for all PDF files I have tried. I have tried with some
proprietary files, PDF eBooks etc. Whatever be the PDF file,
content is not being displayed.

MS Office files -  For some office files, everything works perfect
and the extracted content is visible in the query console.
However, for others, I see the below error message during the
indexing process.

*Exception in thread main
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser*
*
*

I am using SolrJ to index the documents and below is the code
snippet related to indexing. Please let me know where the issue is
occurring.

static String solrServerURL =
http://localhost:8983/solr;;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest
indexingReq = new ContentStreamUpdateRequest(/update/extract);

indexingReq.addFile(file, fileType);
indexingReq.setParam(literal.id http://literal.id, literalId);
indexingReq.setParam(uprefix, attr_);
indexingReq.setParam(fmap.content, content);
indexingReq.setParam(literal.fileurl, fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
true);
solrServer.request(indexingReq);

Thanks  Regards
Vijay




The contents of this e-mail are confidential and for the exclusive use 
of the intended recipient. If you receive this e-mail in error please 
delete it from your system immediately and notify us either by e-mail 
or telephone. You should not copy, forward or otherwise disclose the 
content of the e-mail. The views expressed in this communication may 
not necessarily be the view held by WHISHWORKS. 




Re: Indexing PDF and MS Office files

2015-04-14 Thread Jack Krupansky
Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Hi,

 I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
 Request to please let me know what is going wrong with the indexing
 process.

 I am using solr 4.10.2 and using the default example server configuration
 that comes with Solr distribution.

 PDF Files - Indexing as such works fine, but when I query using *.* in the
 Solr Query console, metadata information is displayed properly. However,
 the PDF content field is empty. This is happening for all PDF files I have
 tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
 be the PDF file, content is not being displayed.

 MS Office files -  For some office files, everything works perfect and the
 extracted content is visible in the query console. However, for others, I
 see the below error message during the indexing process.

 *Exception in thread main
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.microsoft.OfficeParser*


 I am using SolrJ to index the documents and below is the code snippet
 related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = 
 http://localhost:8983/solr;;
 static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq = new

 ContentStreamUpdateRequest(/update/extract);

 indexingReq.addFile(file, fileType);
 indexingReq.setParam(literal.id, literalId);
 indexingReq.setParam(uprefix, attr_);
 indexingReq.setParam(fmap.content, content);
 indexingReq.setParam(literal.fileurl, fileURL);
 indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
 solrServer.request(indexingReq);

 Thanks  Regards
 Vijay

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.



Re: Indexing PDF and MS Office files

2015-04-14 Thread Shyam R
Vijay,

You could try different excel files with different formats to rule out the
issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote:

 Perhaps the PDF is protected and the content can not be extracted?

 i have an unverified suspicion that the tika shipped with solr 4.10.2 may
 not support some/all office 2013 document formats.





 On 4/14/2015 8:18 PM, Jack Krupansky wrote:

 Try doing a manual extraction request directly to Solr (not via SolrJ) and
 use the extractOnly option to see if the content is actually extracted.

 See:
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Data+with+Solr+Cell+using+Apache+Tika

 Also, some PDF files actually have the content as a bitmap image, so no
 text is extracted.


 -- Jack Krupansky

 On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy 
 vijaya.bhoomire...@whishworks.com wrote:

  Hi,

 I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
 Request to please let me know what is going wrong with the indexing
 process.

 I am using solr 4.10.2 and using the default example server configuration
 that comes with Solr distribution.

 PDF Files - Indexing as such works fine, but when I query using *.* in
 the
 Solr Query console, metadata information is displayed properly. However,
 the PDF content field is empty. This is happening for all PDF files I
 have
 tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
 be the PDF file, content is not being displayed.

 MS Office files -  For some office files, everything works perfect and
 the
 extracted content is visible in the query console. However, for others, I
 see the below error message during the indexing process.

 *Exception in thread main
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException
 from
 org.apache.tika.parser.microsoft.OfficeParser*


 I am using SolrJ to index the documents and below is the code snippet
 related to indexing. Please let me know where the issue is occurring.

  static String solrServerURL = 
 http://localhost:8983/solr;;
 static SolrServer solrServer = new HttpSolrServer(solrServerURL);
  static ContentStreamUpdateRequest indexingReq =
 new

  ContentStreamUpdateRequest(/update/extract);

  indexingReq.addFile(file, fileType);
 indexingReq.setParam(literal.id, literalId);
 indexingReq.setParam(uprefix, attr_);
 indexingReq.setParam(fmap.content, content);
 indexingReq.setParam(literal.fileurl, fileURL);
 indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
 solrServer.request(indexingReq);

 Thanks  Regards
 Vijay

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.





-- 
Ph: 9845704792


Re: Indexing PDF and MS Office files

2015-04-14 Thread Terry Rhodes

Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2 
may not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:

Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:


Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread main
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

 static String solrServerURL = 
http://localhost:8983/solr;;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
 static ContentStreamUpdateRequest indexingReq = new

 ContentStreamUpdateRequest(/update/extract);

 indexingReq.addFile(file, fileType);
indexingReq.setParam(literal.id, literalId);
indexingReq.setParam(uprefix, attr_);
indexingReq.setParam(fmap.content, content);
indexingReq.setParam(literal.fileurl, fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks  Regards
Vijay

--
The contents of this e-mail are confidential and for the exclusive use of
the intended recipient. If you receive this e-mail in error please delete
it from your system immediately and notify us either by e-mail or
telephone. You should not copy, forward or otherwise disclose the content
of the e-mail. The views expressed in this communication may not
necessarily be the view held by WHISHWORKS.





Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-14 Thread vignesh
Dear Team,

 

I am Vignesh  using the latest version 4.8.0 Apache Solr and am
Indexing my PDF but getting an error and have posted that below for your
reference. Kindly guide me to solve this error.

 

D:\IPCB\solrjava -Durl=http://localhost:8082/solr/ipcb/update/extract
-Dparams=

literal.id=herald060214_001 -Dtype=application/pdf -jar post.jar
D:/IPCB/ipcbpd

f/herald060214_001.pdf

SimplePostTool version 1.5

Posting files to base url
http://localhost:8082/solr/ipcb/update/extract?literal

.id=herald060214_001 using content-type application/pdf..

POSTing file herald060214_001.pdf

SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error

SimplePostTool: WARNING: IOException while reading response:
java.io.IOException

: Server returned HTTP response code: 500 for URL:
http://localhost:8082/solr/ip

cb/update/extract?literal.id=herald060214_001

1 files indexed.

COMMITting Solr index changes to
http://localhost:8082/solr/ipcb/update/extract?

literal.id=herald060214_001..

SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
for u

rl
http://localhost:8082/solr/ipcb/update/extract?literal.id=herald060214_001c
o

mmit=true

Time spent: 0:00:00.062

 

 

 

Thanks  Regards.

Vignesh.V

 

cid:image001.jpg@01CA4872.39B33D40

Ninestars Information Technologies Limited.,

72, Greams Road, Thousand Lights, Chennai - 600 006. India.

Landline : +91 44 2829 4226 / 36 / 56   X: 144

 blocked::http://www.ninestars.in/ www.ninestars.in 

 


--
STOP Virus, STOP SPAM, SAVE Bandwidth!
http://www.safentrix.com/adlink?cid=0
--

Re: Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-12 Thread Siegfried Goeschl
Hi Vignesh,

can you check your SOLR Server Log?! Not all PDF documents on this planet can 
be processed using Tikka :-)

Cheers,

Siegfried Goeschl

On 07 May 2014, at 09:40, vignesh vignes...@ninestars.in wrote:

 Dear Team,
  
 I am Vignesh  using the latest version 4.8.0 Apache Solr and am 
 Indexing my PDF but getting an error and have posted that below for your 
 reference. Kindly guide me to solve this error.
  
 D:\IPCB\solrjava -Durl=http://localhost:8082/solr/ipcb/update/extract 
 -Dparams=
 literal.id=herald060214_001 -Dtype=application/pdf -jar post.jar 
 D:/IPCB/ipcbpd
 f/herald060214_001.pdf
 SimplePostTool version 1.5
 Posting files to base url 
 http://localhost:8082/solr/ipcb/update/extract?literal
 .id=herald060214_001 using content-type application/pdf..
 POSTing file herald060214_001.pdf
 SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
 SimplePostTool: WARNING: IOException while reading response: 
 java.io.IOException
 : Server returned HTTP response code: 500 for URL: 
 http://localhost:8082/solr/ip
 cb/update/extract?literal.id=herald060214_001
 1 files indexed.
 COMMITting Solr index changes to 
 http://localhost:8082/solr/ipcb/update/extract?
 literal.id=herald060214_001..
 SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error 
 for u
 rl 
 http://localhost:8082/solr/ipcb/update/extract?literal.id=herald060214_001co
 mmit=true
 Time spent: 0:00:00.062
  
  
  
 Thanks  Regards.
 Vignesh.V
  
 
 Ninestars Information Technologies Limited.,
 72, Greams Road, Thousand Lights, Chennai - 600 006. India.
 Landline : +91 44 2829 4226 / 36 / 56   X: 144
 www.ninestars.in
  
 
 STOP Virus, STOP SPAM, SAVE Bandwidth! 
 www.safentrix.com
 



Re: Indexing pdf files - question.

2013-09-08 Thread Nutan Shinde
Error got resolved,solution was dynamic field / must be within fields
tag.


On Sun, Sep 8, 2013 at 3:31 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Could you show us logs you get when you start your web container?


 2013/9/4 Nutan Shinde nutanshinde1...@gmail.com

  My solrconfig.xml is:
 
 
 
  requestHandler name=/update/extract
  class=solr.extraction.ExtractingRequestHandler 
 
  lst name=defaults
 
  str name=fmap.contentdesc/str   !-to map this field of my table
  which
  is defined as shown below in schem.xml--
 
  str name=lowernamestrue/str
 
  str name=uprefixattr_/str
 
  str name=captureAttrtrue/str
 
  /lst
 
  /requestHandler
 
  lib dir=../../extract regex=.*\.jar /
 
 
 
  Schema.xml:
 
  fields
 
  field name=doc_id type=integer indexed=true stored=true
  multiValued=false/
 
  field name=name type=text indexed=true stored=true
  multiValued=false/
 
  field name=path type=text indexed=true stored=true
  multiValued=false/
 
  field name=desc type=text_split indexed=true stored=true
  multiValued=false/
 
  /fields
 
  types
 
  fieldType name=string class=solr.StrField  /
 
  fieldType name=integer class=solr.IntField /
 
  fieldType name=text class=solr.TextField /
 
  fieldType name=text class=solr.TextField /
 
  /types
 
  dynamicField name=*_i  type=integer  indexed=true  stored=true/
 
  uniqueKeydoc_id/uniqueKey
 
 
 
  I have created extract directory and copied all required .jar and
 solr-cell
  jar files into this extract directory and given its path in lib tag in
  solrconfig.xml
 
 
 
  When I try out this:
 
 
 
  curl
  http://localhost:8080/solr/update/extract?literal.doc_id=1commit=true;
 
  -F myfile=@solr-word.pdf mailto:myfile=@solr-word.pdf   in Windows 7.
 
 
 
  I get /solr/update/extract is not available and sometimes I get access
  denied error.
 
  I tried resolving through net,but in vain.as all the solutions are
 related
  to linux os,im working on Windows.
 
  Please help me and provide solutions related o Windows os.
 
  I referred Apache_solr_4_Cookbook.
 
  Thanks a lot.
 
 



Re: Indexing pdf files - question.

2013-09-07 Thread Furkan KAMACI
Could you show us logs you get when you start your web container?


2013/9/4 Nutan Shinde nutanshinde1...@gmail.com

 My solrconfig.xml is:



 requestHandler name=/update/extract
 class=solr.extraction.ExtractingRequestHandler 

 lst name=defaults

 str name=fmap.contentdesc/str   !-to map this field of my table
 which
 is defined as shown below in schem.xml--

 str name=lowernamestrue/str

 str name=uprefixattr_/str

 str name=captureAttrtrue/str

 /lst

 /requestHandler

 lib dir=../../extract regex=.*\.jar /



 Schema.xml:

 fields

 field name=doc_id type=integer indexed=true stored=true
 multiValued=false/

 field name=name type=text indexed=true stored=true
 multiValued=false/

 field name=path type=text indexed=true stored=true
 multiValued=false/

 field name=desc type=text_split indexed=true stored=true
 multiValued=false/

 /fields

 types

 fieldType name=string class=solr.StrField  /

 fieldType name=integer class=solr.IntField /

 fieldType name=text class=solr.TextField /

 fieldType name=text class=solr.TextField /

 /types

 dynamicField name=*_i  type=integer  indexed=true  stored=true/

 uniqueKeydoc_id/uniqueKey



 I have created extract directory and copied all required .jar and solr-cell
 jar files into this extract directory and given its path in lib tag in
 solrconfig.xml



 When I try out this:



 curl
 http://localhost:8080/solr/update/extract?literal.doc_id=1commit=true;

 -F myfile=@solr-word.pdf mailto:myfile=@solr-word.pdf   in Windows 7.



 I get /solr/update/extract is not available and sometimes I get access
 denied error.

 I tried resolving through net,but in vain.as all the solutions are related
 to linux os,im working on Windows.

 Please help me and provide solutions related o Windows os.

 I referred Apache_solr_4_Cookbook.

 Thanks a lot.




Re: Indexing pdf files - question.

2013-09-04 Thread Nutan Shinde
My solrconfig.xml is:

 

requestHandler name=/update/extract
class=solr.extraction.ExtractingRequestHandler 

lst name=defaults

str name=fmap.contentdesc/str   !-to map this field of my table which
is defined as shown below in schem.xml--

str name=lowernamestrue/str

str name=uprefixattr_/str

str name=captureAttrtrue/str

/lst

/requestHandler

lib dir=../../extract regex=.*\.jar /

 

Schema.xml:

fields 

field name=doc_id type=integer indexed=true stored=true
multiValued=false/  

field name=name type=text indexed=true stored=true
multiValued=false/  

field name=path type=text indexed=true stored=true
multiValued=false/

field name=desc type=text_split indexed=true stored=true
multiValued=false/

/fields 

types

fieldType name=string class=solr.StrField  /

fieldType name=integer class=solr.IntField /

fieldType name=text class=solr.TextField /

fieldType name=text class=solr.TextField /

/types

dynamicField name=*_i  type=integer  indexed=true  stored=true/

uniqueKeydoc_id/uniqueKey

 

I have created extract directory and copied all required .jar and solr-cell
jar files into this extract directory and given its path in lib tag in
solrconfig.xml

 

When I try out this:

 

curl
http://localhost:8080/solr/update/extract?literal.doc_id=1commit=true;

-F myfile=@solr-word.pdf mailto:myfile=@solr-word.pdf   in Windows 7.

 

I get /solr/update/extract is not available and sometimes I get access
denied error.

I tried resolving through net,but in vain.as all the solutions are related
to linux os,im working on Windows.

Please help me and provide solutions related o Windows os.

I referred Apache_solr_4_Cookbook.

Thanks a lot.



Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Can you please suggest a way (with example) of assigning this unique key to a
pdf file?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Okay. Can you please suggest a way (with an example) of assigning this unique
key to a pdf file. Say, a unique number to each pdf file. How do i achieve
this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar
We can't tell you what the id of your own document should be. Isn't
there anything which is unique about your pdf files? How about the
file name or the absolute path?

On Tue, Jul 2, 2013 at 11:33 AM, archit2112 archit2...@gmail.com wrote:
 Okay. Can you please suggest a way (with an example) of assigning this unique
 key to a pdf file. Say, a unique number to each pdf file. How do i achieve
 this?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Yes. The absolute path is unique.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074620.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Yes. The absolute path is unique. How do i implement it? can you please
explain?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar
See http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor

The implicit fields generated by the FileListEntityProcessor are
fileDir, file, fileAbsolutePath, fileSize, fileLastModified and these
are available for use within the entity

On Tue, Jul 2, 2013 at 2:47 PM, archit2112 archit2...@gmail.com wrote:
 Yes. The absolute path is unique. How do i implement it? can you please
 explain?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler. 

*My request handler - *

requestHandler name=/dataimport1 
class=org.apache.solr.handler.dataimport.DataImportHandler 
lst name=defaults 
  str name=configdata-config1.xml/str 
/lst 
  /requestHandler 

*My data-config1.xml *

dataConfig 
dataSource type=BinFileDataSource / 
document 
entity name=f dataSource=null rootEntity=false 
processor=FileListEntityProcessor 
baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf 
recursive=true 
entity name=tika-test processor=TikaEntityProcessor 
url=${f.fileAbsolutePath} format=text 
field column=Author name=author meta=true/
field column=title name=title1 meta=true/
field column=text name=text/
/entity 
/entity 
/document 
/dataConfig 


Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky

It all depends on your data model - tell us more about your data model.

For example, how will users or applications query these documents and what 
will they expect to be able to do with the ID/key for the documents?


How are you expecting to identify documents in your data model?

-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 7:17 AM
To: solr-user@lucene.apache.org
Subject: Unique key error while indexing pdf files

Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler.

*My request handler - *

requestHandler name=/dataimport1
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
 str name=configdata-config1.xml/str
   /lst
 /requestHandler

*My data-config1.xml *

dataConfig
dataSource type=BinFileDataSource /
document
entity name=f dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf
recursive=true
entity name=tika-test processor=TikaEntityProcessor
url=${f.fileAbsolutePath} format=text
field column=Author name=author meta=true/
field column=title name=title1 meta=true/
field column=text name=text/
/entity
/entity
/document
/dataConfig


Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky
It's really 100% up to you how you want to come up with the unique key 
values for your documents. What would you like them to be? Just use that. 
Anything (within reason) - anything goes.


But it also comes back to your data model. You absolutely must come up with 
a data model for how you expect to index and query data in Solr before you 
just start throwing random data into Solr.


1. Design your data model.
2. Produce a Solr schema from that data model.
3. Map the raw data from your data sources (e.g., PDF files) to the fields 
in your Solr schema.


That last step includes the ID/key field, but your data model will imply any 
requirements for what the ID/key should be.


To be absolutely clear, it is 100% up to you to design the ID/key for every 
document; Solr does NOT do that for you.


Even if you are just exploring, at least come up with an exploratory 
data model - which includes what expectations you have about the unique 
ID/key for each document.


So, for that first PDF file, what expectation (according to your data model) 
do you have for what its ID/key should be?


-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Unique key error while indexing pdf files

Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-10 Thread Mark Wilson
Hi Michael

Thanks very much for that, it did indeed solve the problem.

I had it setup on my internal servers, as I have a separate script for
tomcat startup, but forgot all about it on the Amazon Cloud servers.

For info

I added 
CATALINA_OPTS=-Djava.awt.headless=true
export CATALINA_OPTS

to $tomcat_home/bin/setenv.sh

Thanks again

Regards Mark


On 07/06/2013 19:29, Michael Della Bitta
michael.della.bi...@appinions.com wrote:

 Hi Mark,
 
 This is a total shot in the dark, but does
 passing  -Djava.awt.headless=true when you run the server help at all?
 
 More on awt headless mode:
 http://www.oracle.com/technetwork/articles/javase/headless-136834.html
 
 Michael Della Bitta
 
 Applications Developer
 
 o: +1 646 532 3062  | c: +1 917 477 7906
 
 appinions inc.
 
 “The Science of Influence Marketing”
 
 18 East 41st Street
 
 New York, NY 10017
 
 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/
 
 
 On Fri, Jun 7, 2013 at 11:31 AM, Mark Wilson m...@sanger.ac.uk wrote:
 
 Hi
 
 I am having an issue with adding pdf documents to a SolrCloud index I have
 setup.
 
 I can index pdf documents fine using 4.3.0 on my local box, but I have a
 SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
 Error.
 
 It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage. However,
 the jar is in the directory, and referenced in the solrconfig.xml file
 
   lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /
 
   lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /
 
   lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /
 
   lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /
 
 When I start Tomcat, I can see that the file has loaded.
 
 2705 [coreLoadExecutor-4-thread-3] INFO
 org.apache.solr.core.SolrResourceLoader  ­ Adding
 'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to classloader
 
 But when I try to add a document.
 
 java
 -Durl=
 http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
 ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
 post.jar 2008.Genomics.pdf
 
 
 I get this error. I¹m running on an Ubuntu machine.
 
 Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59
 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
 
 Error log.
 
 88168 [http-bio-8080-exec-1] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  ­
 [quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
 params={literal.id=pdf1} {} 0 1534
 88180 [http-bio-8080-exec-1] ERROR
 org.apache.solr.servlet.SolrDispatchFilter  ­
 null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
 :670)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 380)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 155)
 at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
 FilterChain.java:243)
 at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
 ain.java:210)
 at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
 va:222)
 at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
 va:123)
 at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
 )
 at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
 at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
 :118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at
 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
 ssor.java:1009)
 at
 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
 tProtocol.java:589)
 at
 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:
 310)
 at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
 45)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
 15)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
  

Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-10 Thread Michael Della Bitta
Glad that helped. I'm going to go buy a lottery ticket now! :)

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Mon, Jun 10, 2013 at 5:56 AM, Mark Wilson m...@sanger.ac.uk wrote:

 Hi Michael

 Thanks very much for that, it did indeed solve the problem.

 I had it setup on my internal servers, as I have a separate script for
 tomcat startup, but forgot all about it on the Amazon Cloud servers.

 For info

 I added
 CATALINA_OPTS=-Djava.awt.headless=true
 export CATALINA_OPTS

 to $tomcat_home/bin/setenv.sh

 Thanks again

 Regards Mark


 On 07/06/2013 19:29, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:

  Hi Mark,
 
  This is a total shot in the dark, but does
  passing  -Djava.awt.headless=true when you run the server help at all?
 
  More on awt headless mode:
  http://www.oracle.com/technetwork/articles/javase/headless-136834.html
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Fri, Jun 7, 2013 at 11:31 AM, Mark Wilson m...@sanger.ac.uk wrote:
 
  Hi
 
  I am having an issue with adding pdf documents to a SolrCloud index I
 have
  setup.
 
  I can index pdf documents fine using 4.3.0 on my local box, but I have a
  SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
  Error.
 
  It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage.
 However,
  the jar is in the directory, and referenced in the solrconfig.xml file
 
lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /
 
lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /
 
lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /
 
lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /
 
  When I start Tomcat, I can see that the file has loaded.
 
  2705 [coreLoadExecutor-4-thread-3] INFO
  org.apache.solr.core.SolrResourceLoader  ­ Adding
  'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to
 classloader
 
  But when I try to add a document.
 
  java
  -Durl=
  http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
  ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
  post.jar 2008.Genomics.pdf
 
 
  I get this error. I¹m running on an Ubuntu machine.
 
  Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11
 18:51:59
  UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
 
  Error log.
 
  88168 [http-bio-8080-exec-1] INFO
  org.apache.solr.update.processor.LogUpdateProcessor  ­
  [quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
  params={literal.id=pdf1} {} 0 1534
  88180 [http-bio-8080-exec-1] ERROR
  org.apache.solr.servlet.SolrDispatchFilter  ­
  null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
  /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so:
 libXrender.so.1:
  cannot open shared object file: No such file or directory
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
  :670)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
  380)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
  155)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
  FilterChain.java:243)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
  ain.java:210)
  at
 
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
  va:222)
  at
 
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
  va:123)
  at
 
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
  )
  at
 
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
  at
 
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
  at
 
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
  :118)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
  at
 
 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
  ssor.java:1009)
  at
 
 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
  tProtocol.java:589)
  

Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Mark Wilson
Hi

I am having an issue with adding pdf documents to a SolrCloud index I have
setup.

I can index pdf documents fine using 4.3.0 on my local box, but I have a
SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
Error.

It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage. However,
the jar is in the directory, and referenced in the solrconfig.xml file

  lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /

  lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
  lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /

When I start Tomcat, I can see that the file has loaded.

2705 [coreLoadExecutor-4-thread-3] INFO
org.apache.solr.core.SolrResourceLoader  ­ Adding
'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to classloader

But when I try to add a document.

java 
-Durl=http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
post.jar 2008.Genomics.pdf


I get this error. I¹m running on an Ubuntu machine.

Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Error log.

88168 [http-bio-8080-exec-1] INFO
org.apache.solr.update.processor.LogUpdateProcessor  ­
[quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
params={literal.id=pdf1} {} 0 1534
88180 [http-bio-8080-exec-1] ERROR
org.apache.solr.servlet.SolrDispatchFilter  ­
null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
/usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
cannot open shared object file: No such file or directory
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
155)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:243)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:210)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:222)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:123)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:118)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
ssor.java:1009)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
tProtocol.java:589)
at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:
310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
45)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
15)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.UnsatisfiedLinkError:
/usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1825)
at java.lang.Runtime.load0(Runtime.java:792)
at java.lang.System.load(System.java:1059)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1846)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
at java.lang.System.loadLibrary(System.java:1084)
at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:67)
at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:47)
at java.security.AccessController.doPrivileged(Native Method)
at java.awt.Toolkit.loadLibraries(Toolkit.java:1648)
at java.awt.Toolkit.clinit(Toolkit.java:1670)
at java.awt.Color.clinit(Color.java:275)
at org.apache.pdfbox.pdmodel.PDPage.clinit(PDPage.java:72)
at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:212)
at 

Re: Solr 4.3.0 Cloud Issue indexing pdf documents

2013-06-07 Thread Michael Della Bitta
Hi Mark,

This is a total shot in the dark, but does
passing  -Djava.awt.headless=true when you run the server help at all?

More on awt headless mode:
http://www.oracle.com/technetwork/articles/javase/headless-136834.html

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Fri, Jun 7, 2013 at 11:31 AM, Mark Wilson m...@sanger.ac.uk wrote:

 Hi

 I am having an issue with adding pdf documents to a SolrCloud index I have
 setup.

 I can index pdf documents fine using 4.3.0 on my local box, but I have a
 SolrCloud instance setup on the Amazon Cloud (Using 2 servers) and I get
 Error.

 It seems that it is not loading org.apache.pdfbox.pdmodel.PDPage. However,
 the jar is in the directory, and referenced in the solrconfig.xml file

   lib dir=/www/solr/lib/contrib/extraction/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-cell-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/clustering/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-clustering-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/langid/lib/ regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-langid-\d.*\.jar /

   lib dir=/www/solr/lib/contrib/velocity/lib regex=.*\.jar /
   lib dir=/www/solr/lib/ regex=solr-velocity-\d.*\.jar /

 When I start Tomcat, I can see that the file has loaded.

 2705 [coreLoadExecutor-4-thread-3] INFO
 org.apache.solr.core.SolrResourceLoader  ­ Adding
 'file:/www/solr/lib/contrib/extraction/lib/pdfbox-1.7.1.jar' to classloader

 But when I try to add a document.

 java
 -Durl=
 http://ec2-blah-blaheu-west-1.compute.amazonaws.com:8080/solr/quosa2-c
 ollection/update/extract -Dparams=literal.id=pdf1 -Dtype=text/pdf -jar
 post.jar 2008.Genomics.pdf


 I get this error. I¹m running on an Ubuntu machine.

 Linux ip-10-229-125-163 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59
 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

 Error log.

 88168 [http-bio-8080-exec-1] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  ­
 [quosa2-collection_shard1_replica1] webapp=/solr path=/update/extract
 params={literal.id=pdf1} {} 0 1534
 88180 [http-bio-8080-exec-1] ERROR
 org.apache.solr.servlet.SolrDispatchFilter  ­
 null:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at

 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java
 :670)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 380)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 155)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
 FilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
 ain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
 va:222)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
 va:123)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171
 )
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
 :118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Proce
 ssor.java:1009)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Abstrac
 tProtocol.java:589)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:
 310)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
 45)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
 15)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.UnsatisfiedLinkError:
 /usr/lib/jvm/java-7-oracle/jre/lib/amd64/xawt/libmawt.so: libXrender.so.1:
 cannot open shared object file: No such file or directory
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
 at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
 at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1825)
 at java.lang.Runtime.load0(Runtime.java:792)
 at java.lang.System.load(System.java:1059)
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
 at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1939)
 at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1864)
 at 

  1   2   >