Re: Data Import Handler Rich Format Documents

2010-09-29 Thread Chris Hostetter

: What's a GA release?

http://en.wikipedia.org/wiki/Software_release_life_cycle#General_availability

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Dennis Gearon
What's a GA release?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/24/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: Data Import Handler Rich Format Documents
> To: solr-user@lucene.apache.org
> Date: Friday, September 24, 2010, 6:19 PM
> The TikaEntityProcessor is the class
> in the DIH that calls the Tika libraries.
> TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in
> the trunk and the 3.x branch.
> 
> I have set it up from the 3.x branch. I discovered that the
> "DefaultParser" does not work, and you have to explicitly
> name the parser for the file format you want to use.
> 
> https://issues.apache.org/jira/browse/SOLR-2116
> 
> Tod wrote:
> > On 9/23/2010 6:52 AM, mehdi.es...@gmail.com
> wrote:
> >> Hi,
> >> I have exactly the same problem than the one you
> submitted in this link 
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html
> and I would like to ask you if you got a solution for that.
> >> I started to have a look on tika and
> DataImportHandler but I don't success to find to right way
> of writing the syntax.
> >> So can you please give an example if you successed
> to find the right syntax.
> >> Thanks.
> > 
> > Bumping this to the list...
> > 
> > Unfortunately I could never get DIH to work
> correctly.  My suspicion is that I was using a stock
> 1.4.0 Solr but attempting to perform a task that was only
> available on the latest build.  My customer
> requirements demand a pretty well vetted GA release so
> experimenting was not an option.  I attempted an
> upgrade (quickly, sloppily) to 1.4.1 but no luck.  I
> believe the next GA release might be my solution.
> > 
> > I tried getting around that bump by trying SolrJ
> ContentStreamUpdateRequest @ 
> http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. 
> After floundering for a while I decided to put that on
> hold.  I ended up writing a Perl script that emulates
> the command line cURL that I referenced in the above
> thread.  It took about 72 hours to index ~850,000
> entries (if anyone is interested).
> > 
> > I plan on looping back to try the suggestions Hoss
> last made, just haven't had the time to respond.  I'm
> sure things will work I just needed something quickly and
> don't have the seasoned experience the other developers do.
> > 
> > 
> > - Tod
>


Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Lance Norskog
The TikaEntityProcessor is the class in the DIH that calls the Tika 
libraries.
TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk and 
the 3.x branch.


I have set it up from the 3.x branch. I discovered that the 
"DefaultParser" does not work, and you have to explicitly name the 
parser for the file format you want to use.


https://issues.apache.org/jira/browse/SOLR-2116

Tod wrote:

On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:

Hi,
I have exactly the same problem than the one you submitted in this 
link 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html 
and I would like to ask you if you got a solution for that.
I started to have a look on tika and DataImportHandler but I don't 
success to find to right way of writing the syntax.
So can you please give an example if you successed to find the right 
syntax.

Thanks.


Bumping this to the list...

Unfortunately I could never get DIH to work correctly.  My suspicion 
is that I was using a stock 1.4.0 Solr but attempting to perform a 
task that was only available on the latest build.  My customer 
requirements demand a pretty well vetted GA release so experimenting 
was not an option.  I attempted an upgrade (quickly, sloppily) to 
1.4.1 but no luck.  I believe the next GA release might be my solution.


I tried getting around that bump by trying SolrJ 
ContentStreamUpdateRequest @ 
http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. 
 After floundering for a while I decided to put that on hold.  I ended 
up writing a Perl script that emulates the command line cURL that I 
referenced in the above thread.  It took about 72 hours to index 
~850,000 entries (if anyone is interested).


I plan on looping back to try the suggestions Hoss last made, just 
haven't had the time to respond.  I'm sure things will work I just 
needed something quickly and don't have the seasoned experience the 
other developers do.



- Tod


Re: Data Import Handler Rich Format Documents

2010-09-24 Thread Tod

On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote:

Hi,
I have exactly the same problem than the one you submitted in this link 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html
 and I would like to ask you if you got a solution for that.
I started to have a look on tika and DataImportHandler but I don't success to 
find to right way of writing the syntax.
So can you please give an example if you successed to find the right syntax.
Thanks.


Bumping this to the list...

Unfortunately I could never get DIH to work correctly.  My suspicion is 
that I was using a stock 1.4.0 Solr but attempting to perform a task 
that was only available on the latest build.  My customer requirements 
demand a pretty well vetted GA release so experimenting was not an 
option.  I attempted an upgrade (quickly, sloppily) to 1.4.1 but no 
luck.  I believe the next GA release might be my solution.


I tried getting around that bump by trying SolrJ 
ContentStreamUpdateRequest @ 
http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. 
 After floundering for a while I decided to put that on hold.  I ended 
up writing a Perl script that emulates the command line cURL that I 
referenced in the above thread.  It took about 72 hours to index 
~850,000 entries (if anyone is interested).


I plan on looping back to try the suggestions Hoss last made, just 
haven't had the time to respond.  I'm sure things will work I just 
needed something quickly and don't have the seasoned experience the 
other developers do.



- Tod


Re: Data Import Handler Rich Format Documents

2010-07-06 Thread Tod

On 6/28/2010 8:28 AM, Alexey Serba wrote:

Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using
Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
org.apache.solr.handler.dataimport.BinURLDataSource

It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583



Thanks, that would explain things - I'm using a stock 1.4.0 download.



My data-config.xml looks like this:


�

�

�
� �
� � �
� � �
� � �
� � �
� � �
� � �
� � �
� �

� �
� � 
� � �url="http://www.mysite.com/${my_database.content_url}";
� � �
� � 
� �

�


I added the entity name="my_database_url" section to an existing (working)
database entity to be able to have Tika index the content pointed to by the
content_url.

Is there anything obviously wrong with what I've tried so far?


I think you should move Tika entity into my_database entity and
simplify the whole configuration


...


http://www.mysite.com/${my_database.content_url}";






This, I guess, would be after I checked out and built from trunk?


Thanks - Tod


Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba
> Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using
> Solr Version: 1.4.0 and getting the following error:
>
> java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
> org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583

> My data-config.xml looks like this:
>
> 
>      driver="oracle.jdbc.driver.OracleDriver"
>    url="jdbc:oracle:thin:@whatever:12345:whatever"
>    user="me"
>    name="ds-db"
>    password="secret"/>
>
>      name="ds-url"/>
>
>  
>         dataSource="ds-db"
>     query="select * from my_database where rownum <=2">
>      
>      
>      
>      
>      
>      
>      
>    
>
>         dataSource="ds-url"
>     query="select CONTENT_URL from my_database where
> content_id='${my_database.CONTENT_ID}'">
>           dataSource="ds-url"
>      format="text">
>      url="http://www.mysite.com/${my_database.content_url}";
>      
>     
>    
>
>  
> 
>
> I added the entity name="my_database_url" section to an existing (working)
> database entity to be able to have Tika index the content pointed to by the
> content_url.
>
> Is there anything obviously wrong with what I've tried so far?

I think you should move Tika entity into my_database entity and
simplify the whole configuration


...


http://www.mysite.com/${my_database.content_url}";





Re: Data Import Handler Rich Format Documents

2010-06-22 Thread Tod

On 6/18/2010 2:42 PM, Chris Hostetter wrote:

: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to 
documents that you want to parse with Tika?


Why would you need a custom Transformer?

http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
http://wiki.apache.org/solr/TikaEntityProcessor

-Hoss


Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm 
using Solr Version: 1.4.0 and getting the following error:


java.lang.ClassNotFoundException: Unable to load BinURLDataSource or 
org.apache.solr.handler.dataimport.BinURLDataSource


curl -s http://test.html|curl 
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary 
@-  -H 'Content-type:text/html'


... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:


  

  

  

  
  
  
  
  
  
  


 query="select CONTENT_URL from my_database where 
content_id='${my_database.CONTENT_ID}'">

 
  url="http://www.mysite.com/${my_database.content_url}";
  
 


  


I added the entity name="my_database_url" section to an existing 
(working) database entity to be able to have Tika index the content 
pointed to by the content_url.


Is there anything obviously wrong with what I've tried so far?


Thanks - Tod


Re: Data Import Handler Rich Format Documents

2010-06-21 Thread Alexey Serba
You are right. It seems TikaEntityProcessor is exactly the tool you
need in this case.

Alex

On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter
 wrote:
> : I think you can use existing ExtractingRequestHandler to do the job,
> : i.e. add child entity to your DIH metadata
>
> why would you do this instead of using the TikaEntityProcessor as i
> already suggested in my earlier mail?
>
>
>
> -Hoss
>
>


Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter
: I think you can use existing ExtractingRequestHandler to do the job,
: i.e. add child entity to your DIH metadata

why would you do this instead of using the TikaEntityProcessor as i 
already suggested in my earlier mail?



-Hoss



Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Alexey Serba
I think you can use existing ExtractingRequestHandler to do the job,
i.e. add child entity to your DIH metadata




http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}";
dataSource="solr">




That's not working example, just basic idea, you still need to
uri_escape ${metadata.url} reference probably using some transformer
(regexp, javascript?) and extract file content from ERH xml response
using xpath and probably do some html stripping.

HTH,
Alex

On Fri, Jun 18, 2010 at 4:51 PM, Tod  wrote:
> I have a database containing Metadata from a content management system.
>  Part of that data includes a URL pointing to the actual published document
> which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
>
> I'm already indexing the Metadata and that provides a lot of value.  The
> customer however would like that the content pointed to by the URL also be
> indexed for more discrete searching.
>
> This article at Lucid:
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
>
> describes the process of coding a custom transformer.  A separate article
> I've read implies Nutch could be used to provide this functionality too.
>
> What would be the best and most efficient way to accomplish what I'm trying
> to do?  I have a feeling the Lucid article might be dated and there might
> ways to do this now without any coding and maybe without even needing to use
> Nutch.  I'm using the current release version of Solr.
>
> Thanks in advance.
>
>
> - Tod
>


Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod

On 6/18/2010 2:42 PM, Chris Hostetter wrote:

: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to 
documents that you want to parse with Tika?


Why would you need a custom Transformer?


I started this thread after reading a Lucid article suggesting a custom 
Transformer might be the way to go when using DIH.  My initial question 
was if there was an alternative.


My database contains only Metadata and a reference to the actual content 
(HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid 
article focused on.  What I would like to do is use DIH somehow to index 
the Metadata but also the actual content pointed to by the URL column.


I might be able to do this instead with Nutch (who uses Tika), haven't 
thoroughly researched this yet, or I can write a job to pull all the 
URL's out of the database and utilize cURL and the Solr 
ExtractingRequestHandler to push everything into the index.  I just 
wanted to see what everybody else is doing and what my other options 
might be.



Thanks - Tod


Ref:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS


Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Sixten Otto
On Fri, Jun 18, 2010 at 2:42 PM, Chris Hostetter
 wrote:
> I'm confused ... You're using DIH, and some of your fields are URLs to
> documents that you want to parse with Tika?
>
> Why would you need a custom Transformer?

Yeah, I can definitely vouch that DIH can handle this without
additional coding. (The Lucid article the OP linked to looks like it's
defining a custom Transformer because the document is in a BLOB in the
database.)

However, the DIH in Solr 1.4 doesn't have the Tika support you'd need.
You would need to go with either trunk or branch_3x to make this work.

Sixten


Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Chris Hostetter

: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to 
documents that you want to parse with Tika?

Why would you need a custom Transformer?

http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
http://wiki.apache.org/solr/TikaEntityProcessor

-Hoss



Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod

On 6/18/2010 11:24 AM, Otis Gospodnetic wrote:

Tod,

I don't think DIH can do that, but who knows, let's see what others say.
Yes, Nutch uses TIKA, too.

 Otis


Looks like the ExtractingRequestHandler uses Tika as well.  I might just 
use this but I'm wondering if there will be a large performance 
difference between using it to batch content in over rolling my own 
Transformer?



- Tod



Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
Tod,

I don't think DIH can do that, but who knows, let's see what others say.
Yes, Nutch uses TIKA, too.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Tod 
> To: solr-user@lucene.apache.org
> Sent: Fri, June 18, 2010 10:20:34 AM
> Subject: Re: Data Import Handler Rich Format Documents
> 
> On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:
> Tod,
> 
> You 
> didn't mention Tika, which makes me think you are not aware of it...
> You 
> could implement a custom Transformer that uses Tika to perform rich doc text 
> extraction, just like ExtractingRequestHandler does it (see 
> href="http://wiki.apache.org/solr/ExtractingRequestHandler"; target=_blank 
> >http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you 
> could even just call ERH from your Transformer, though that wouldn't be the 
> most 
> efficient.


You're right, sorry.  I have looked at Tika, which I 
> believe is used by Nutch too - no?

Implementing a transformer is 
> fine.  I guess I'm being lazy and trying to see if a method of doing this 
> has been incorporated into the latest Solr release so I can avoid coding for 
> it.




> 
> 
> - Original Message 
> 
>> From: Tod <
> href="mailto:listac...@gmail.com";>listac...@gmail.com>
>> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
>> 
> Sent: Fri, June 18, 2010 8:51:02 AM
>> Subject: Data Import Handler 
> Rich Format Documents
>> 
>> I have a database containing 
> Metadata from a content management system.  Part of that data includes a 
> URL pointing to the actual published document which can be an HTML file or a 
> PDF, MS Word/Excel/Powerpoint, etc.
> 
> I'm already 
>> 
> indexing the Metadata and that provides a lot of value.  The customer 
> however would like that the content pointed to by the URL also be indexed for 
> more discrete searching.
> 
> This article at Lucid:
> 
> 
> 
>> href="
> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
> > 
> target=_blank 
>>> 
> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
> 
> 
> describes 
>> the process of coding a custom 
> transformer.  A separate article I've read implies Nutch could be used to 
> provide this functionality too.
> 
> What would 
>> be the 
> best and most efficient way to accomplish what I'm trying to do?  I have a 
> feeling the Lucid article might be dated and there might ways to do this now 
> without any coding and maybe without even needing to use Nutch.  I'm using 
> the current release version of Solr.
> 
> Thanks in 
>> 
> advance.
> 
> 
> - Tod
> 


Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Tod

On 6/18/2010 9:12 AM, Otis Gospodnetic wrote:

Tod,

You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.



You're right, sorry.  I have looked at Tika, which I believe is used by 
Nutch too - no?


Implementing a transformer is fine.  I guess I'm being lazy and trying 
to see if a method of doing this has been incorporated into the latest 
Solr release so I can avoid coding for it.








- Original Message 

From: Tod 
To: solr-user@lucene.apache.org
Sent: Fri, June 18, 2010 8:51:02 AM
Subject: Data Import Handler Rich Format Documents

I have a database containing Metadata from a content management system.  
Part of that data includes a URL pointing to the actual published document which 
can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.


I'm already 
indexing the Metadata and that provides a lot of value.  The customer 
however would like that the content pointed to by the URL also be indexed for 
more discrete searching.


This article at Lucid:


href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; 
target=_blank 

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS


describes 
the process of coding a custom transformer.  A separate article I've read 
implies Nutch could be used to provide this functionality too.


What would 
be the best and most efficient way to accomplish what I'm trying to do?  I 
have a feeling the Lucid article might be dated and there might ways to do this 
now without any coding and maybe without even needing to use Nutch.  I'm 
using the current release version of Solr.


Thanks in 

advance.



- Tod





Re: Data Import Handler Rich Format Documents

2010-06-18 Thread Otis Gospodnetic
Tod,

You didn't mention Tika, which makes me think you are not aware of it...
You could implement a custom Transformer that uses Tika to perform rich doc 
text extraction, just like ExtractingRequestHandler does it (see 
http://wiki.apache.org/solr/ExtractingRequestHandler ).  Maybe you could even 
just call ERH from your Transformer, though that wouldn't be the most efficient.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Tod 
> To: solr-user@lucene.apache.org
> Sent: Fri, June 18, 2010 8:51:02 AM
> Subject: Data Import Handler Rich Format Documents
> 
> I have a database containing Metadata from a content management system.  
> Part of that data includes a URL pointing to the actual published document 
> which 
> can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.

I'm already 
> indexing the Metadata and that provides a lot of value.  The customer 
> however would like that the content pointed to by the URL also be indexed for 
> more discrete searching.

This article at Lucid:


> href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS";
>  
> target=_blank 
> >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes 
> the process of coding a custom transformer.  A separate article I've read 
> implies Nutch could be used to provide this functionality too.

What would 
> be the best and most efficient way to accomplish what I'm trying to do?  I 
> have a feeling the Lucid article might be dated and there might ways to do 
> this 
> now without any coding and maybe without even needing to use Nutch.  I'm 
> using the current release version of Solr.

Thanks in 
> advance.


- Tod


Data Import Handler Rich Format Documents

2010-06-18 Thread Tod
I have a database containing Metadata from a content management system. 
 Part of that data includes a URL pointing to the actual published 
document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.


I'm already indexing the Metadata and that provides a lot of value.  The 
customer however would like that the content pointed to by the URL also 
be indexed for more discrete searching.


This article at Lucid:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS

describes the process of coding a custom transformer.  A separate 
article I've read implies Nutch could be used to provide this 
functionality too.


What would be the best and most efficient way to accomplish what I'm 
trying to do?  I have a feeling the Lucid article might be dated and 
there might ways to do this now without any coding and maybe without 
even needing to use Nutch.  I'm using the current release version of Solr.


Thanks in advance.


- Tod