Re: Data Import Handler Rich Format Documents
: What's a GA release? http://en.wikipedia.org/wiki/Software_release_life_cycle#General_availability -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Data Import Handler Rich Format Documents
What's a GA release? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/24/10, Lance Norskog wrote: > From: Lance Norskog > Subject: Re: Data Import Handler Rich Format Documents > To: solr-user@lucene.apache.org > Date: Friday, September 24, 2010, 6:19 PM > The TikaEntityProcessor is the class > in the DIH that calls the Tika libraries. > TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in > the trunk and the 3.x branch. > > I have set it up from the 3.x branch. I discovered that the > "DefaultParser" does not work, and you have to explicitly > name the parser for the file format you want to use. > > https://issues.apache.org/jira/browse/SOLR-2116 > > Tod wrote: > > On 9/23/2010 6:52 AM, mehdi.es...@gmail.com > wrote: > >> Hi, > >> I have exactly the same problem than the one you > submitted in this link > http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html > and I would like to ask you if you got a solution for that. > >> I started to have a look on tika and > DataImportHandler but I don't success to find to right way > of writing the syntax. > >> So can you please give an example if you successed > to find the right syntax. > >> Thanks. > > > > Bumping this to the list... > > > > Unfortunately I could never get DIH to work > correctly. My suspicion is that I was using a stock > 1.4.0 Solr but attempting to perform a task that was only > available on the latest build. My customer > requirements demand a pretty well vetted GA release so > experimenting was not an option. I attempted an > upgrade (quickly, sloppily) to 1.4.1 but no luck. I > believe the next GA release might be my solution. > > > > I tried getting around that bump by trying SolrJ > ContentStreamUpdateRequest @ > http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. > After floundering for a while I decided to put that on > hold. I ended up writing a Perl script that emulates > the command line cURL that I referenced in the above > thread. It took about 72 hours to index ~850,000 > entries (if anyone is interested). > > > > I plan on looping back to try the suggestions Hoss > last made, just haven't had the time to respond. I'm > sure things will work I just needed something quickly and > don't have the seasoned experience the other developers do. > > > > > > - Tod >
Re: Data Import Handler Rich Format Documents
The TikaEntityProcessor is the class in the DIH that calls the Tika libraries. TikaEntityProcessor is not in Solr 1.4 or 1.4.1. It is in the trunk and the 3.x branch. I have set it up from the 3.x branch. I discovered that the "DefaultParser" does not work, and you have to explicitly name the parser for the file format you want to use. https://issues.apache.org/jira/browse/SOLR-2116 Tod wrote: On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to have a look on tika and DataImportHandler but I don't success to find to right way of writing the syntax. So can you please give an example if you successed to find the right syntax. Thanks. Bumping this to the list... Unfortunately I could never get DIH to work correctly. My suspicion is that I was using a stock 1.4.0 Solr but attempting to perform a task that was only available on the latest build. My customer requirements demand a pretty well vetted GA release so experimenting was not an option. I attempted an upgrade (quickly, sloppily) to 1.4.1 but no luck. I believe the next GA release might be my solution. I tried getting around that bump by trying SolrJ ContentStreamUpdateRequest @ http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. After floundering for a while I decided to put that on hold. I ended up writing a Perl script that emulates the command line cURL that I referenced in the above thread. It took about 72 hours to index ~850,000 entries (if anyone is interested). I plan on looping back to try the suggestions Hoss last made, just haven't had the time to respond. I'm sure things will work I just needed something quickly and don't have the seasoned experience the other developers do. - Tod
Re: Data Import Handler Rich Format Documents
On 9/23/2010 6:52 AM, mehdi.es...@gmail.com wrote: Hi, I have exactly the same problem than the one you submitted in this link http://lucene.472066.n3.nabble.com/Data-Import-Handler-Rich-Format-Documents-td905478.html and I would like to ask you if you got a solution for that. I started to have a look on tika and DataImportHandler but I don't success to find to right way of writing the syntax. So can you please give an example if you successed to find the right syntax. Thanks. Bumping this to the list... Unfortunately I could never get DIH to work correctly. My suspicion is that I was using a stock 1.4.0 Solr but attempting to perform a task that was only available on the latest build. My customer requirements demand a pretty well vetted GA release so experimenting was not an option. I attempted an upgrade (quickly, sloppily) to 1.4.1 but no luck. I believe the next GA release might be my solution. I tried getting around that bump by trying SolrJ ContentStreamUpdateRequest @ http://lucene.472066.n3.nabble.com/Solrj-ContentStreamUpdateRequest-Slow-td1023630.html#a1301927. After floundering for a while I decided to put that on hold. I ended up writing a Perl script that emulates the command line cURL that I referenced in the above thread. It took about 72 hours to index ~850,000 entries (if anyone is interested). I plan on looping back to try the suggestions Hoss last made, just haven't had the time to respond. I'm sure things will work I just needed something quickly and don't have the seasoned experience the other developers do. - Tod
Re: Data Import Handler Rich Format Documents
On 6/28/2010 8:28 AM, Alexey Serba wrote: Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1 release. You should use trunk / nightly builds. https://issues.apache.org/jira/browse/SOLR-1583 Thanks, that would explain things - I'm using a stock 1.4.0 download. My data-config.xml looks like this: � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �url="http://www.mysite.com/${my_database.content_url}"; � � � � � � � � I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? I think you should move Tika entity into my_database entity and simplify the whole configuration ... http://www.mysite.com/${my_database.content_url}"; This, I guess, would be after I checked out and built from trunk? Thanks - Tod
Re: Data Import Handler Rich Format Documents
> Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using > Solr Version: 1.4.0 and getting the following error: > > java.lang.ClassNotFoundException: Unable to load BinURLDataSource or > org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1 release. You should use trunk / nightly builds. https://issues.apache.org/jira/browse/SOLR-1583 > My data-config.xml looks like this: > > > driver="oracle.jdbc.driver.OracleDriver" > url="jdbc:oracle:thin:@whatever:12345:whatever" > user="me" > name="ds-db" > password="secret"/> > > name="ds-url"/> > > > dataSource="ds-db" > query="select * from my_database where rownum <=2"> > > > > > > > > > > dataSource="ds-url" > query="select CONTENT_URL from my_database where > content_id='${my_database.CONTENT_ID}'"> > dataSource="ds-url" > format="text"> > url="http://www.mysite.com/${my_database.content_url}"; > > > > > > > > I added the entity name="my_database_url" section to an existing (working) > database entity to be able to have Tika index the content pointed to by the > content_url. > > Is there anything obviously wrong with what I've tried so far? I think you should move Tika entity into my_database entity and simplify the whole configuration ... http://www.mysite.com/${my_database.content_url}";
Re: Data Import Handler Rich Format Documents
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transformer? I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika? Why would you need a custom Transformer? http://wiki.apache.org/solr/DataImportHandler#Tika_Integration http://wiki.apache.org/solr/TikaEntityProcessor -Hoss Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource curl -s http://test.html|curl http://localhost:9080/solr/update/extract?extractOnly=true --data-binary @- -H 'Content-type:text/html' ... works fine so presumably my Tika processor is working. My data-config.xml looks like this: query="select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}'"> url="http://www.mysite.com/${my_database.content_url}"; I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? Thanks - Tod
Re: Data Import Handler Rich Format Documents
You are right. It seems TikaEntityProcessor is exactly the tool you need in this case. Alex On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter wrote: > : I think you can use existing ExtractingRequestHandler to do the job, > : i.e. add child entity to your DIH metadata > > why would you do this instead of using the TikaEntityProcessor as i > already suggested in my earlier mail? > > > > -Hoss > >
Re: Data Import Handler Rich Format Documents
: I think you can use existing ExtractingRequestHandler to do the job, : i.e. add child entity to your DIH metadata why would you do this instead of using the TikaEntityProcessor as i already suggested in my earlier mail? -Hoss
Re: Data Import Handler Rich Format Documents
I think you can use existing ExtractingRequestHandler to do the job, i.e. add child entity to your DIH metadata http://localhost:8983/solr/update/extract?extractOnly=true&wt=xml&indent=on&stream.url=${metadata.url}"; dataSource="solr"> That's not working example, just basic idea, you still need to uri_escape ${metadata.url} reference probably using some transformer (regexp, javascript?) and extract file content from ERH xml response using xpath and probably do some html stripping. HTH, Alex On Fri, Jun 18, 2010 at 4:51 PM, Tod wrote: > I have a database containing Metadata from a content management system. > Part of that data includes a URL pointing to the actual published document > which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. > > I'm already indexing the Metadata and that provides a lot of value. The > customer however would like that the content pointed to by the URL also be > indexed for more discrete searching. > > This article at Lucid: > > http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS > > describes the process of coding a custom transformer. A separate article > I've read implies Nutch could be used to provide this functionality too. > > What would be the best and most efficient way to accomplish what I'm trying > to do? I have a feeling the Lucid article might be dated and there might > ways to do this now without any coding and maybe without even needing to use > Nutch. I'm using the current release version of Solr. > > Thanks in advance. > > > - Tod >
Re: Data Import Handler Rich Format Documents
On 6/18/2010 2:42 PM, Chris Hostetter wrote: : > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transformer? I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika? Why would you need a custom Transformer? I started this thread after reading a Lucid article suggesting a custom Transformer might be the way to go when using DIH. My initial question was if there was an alternative. My database contains only Metadata and a reference to the actual content (HTML,Office Documents, PDF...) as a URL - not blobs as the Lucid article focused on. What I would like to do is use DIH somehow to index the Metadata but also the actual content pointed to by the URL column. I might be able to do this instead with Nutch (who uses Tika), haven't thoroughly researched this yet, or I can write a job to pull all the URL's out of the database and utilize cURL and the Solr ExtractingRequestHandler to push everything into the index. I just wanted to see what everybody else is doing and what my other options might be. Thanks - Tod Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS
Re: Data Import Handler Rich Format Documents
On Fri, Jun 18, 2010 at 2:42 PM, Chris Hostetter wrote: > I'm confused ... You're using DIH, and some of your fields are URLs to > documents that you want to parse with Tika? > > Why would you need a custom Transformer? Yeah, I can definitely vouch that DIH can handle this without additional coding. (The Lucid article the OP linked to looks like it's defining a custom Transformer because the document is in a BLOB in the database.) However, the DIH in Solr 1.4 doesn't have the Tika support you'd need. You would need to go with either trunk or branch_3x to make this work. Sixten
Re: Data Import Handler Rich Format Documents
: > I don't think DIH can do that, but who knows, let's see what others say. : Looks like the ExtractingRequestHandler uses Tika as well. I might just use : this but I'm wondering if there will be a large performance difference between : using it to batch content in over rolling my own Transformer? I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika? Why would you need a custom Transformer? http://wiki.apache.org/solr/DataImportHandler#Tika_Integration http://wiki.apache.org/solr/TikaEntityProcessor -Hoss
Re: Data Import Handler Rich Format Documents
On 6/18/2010 11:24 AM, Otis Gospodnetic wrote: Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too. Otis Looks like the ExtractingRequestHandler uses Tika as well. I might just use this but I'm wondering if there will be a large performance difference between using it to batch content in over rolling my own Transformer? - Tod
Re: Data Import Handler Rich Format Documents
Tod, I don't think DIH can do that, but who knows, let's see what others say. Yes, Nutch uses TIKA, too. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Tod > To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 10:20:34 AM > Subject: Re: Data Import Handler Rich Format Documents > > On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: > Tod, > > You > didn't mention Tika, which makes me think you are not aware of it... > You > could implement a custom Transformer that uses Tika to perform rich doc text > extraction, just like ExtractingRequestHandler does it (see > href="http://wiki.apache.org/solr/ExtractingRequestHandler"; target=_blank > >http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you > could even just call ERH from your Transformer, though that wouldn't be the > most > efficient. You're right, sorry. I have looked at Tika, which I > believe is used by Nutch too - no? Implementing a transformer is > fine. I guess I'm being lazy and trying to see if a method of doing this > has been incorporated into the latest Solr release so I can avoid coding for > it. > > > - Original Message > >> From: Tod < > href="mailto:listac...@gmail.com";>listac...@gmail.com> >> To: > ymailto="mailto:solr-user@lucene.apache.org"; > href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org >> > Sent: Fri, June 18, 2010 8:51:02 AM >> Subject: Data Import Handler > Rich Format Documents >> >> I have a database containing > Metadata from a content management system. Part of that data includes a > URL pointing to the actual published document which can be an HTML file or a > PDF, MS Word/Excel/Powerpoint, etc. > > I'm already >> > indexing the Metadata and that provides a lot of value. The customer > however would like that the content pointed to by the URL also be indexed for > more discrete searching. > > This article at Lucid: > > > >> href=" > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; > > > target=_blank >>> > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS > > > describes >> the process of coding a custom > transformer. A separate article I've read implies Nutch could be used to > provide this functionality too. > > What would >> be the > best and most efficient way to accomplish what I'm trying to do? I have a > feeling the Lucid article might be dated and there might ways to do this now > without any coding and maybe without even needing to use Nutch. I'm using > the current release version of Solr. > > Thanks in >> > advance. > > > - Tod >
Re: Data Import Handler Rich Format Documents
On 6/18/2010 9:12 AM, Otis Gospodnetic wrote: Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even just call ERH from your Transformer, though that wouldn't be the most efficient. You're right, sorry. I have looked at Tika, which I believe is used by Nutch too - no? Implementing a transformer is fine. I guess I'm being lazy and trying to see if a method of doing this has been incorporated into the latest Solr release so I can avoid coding for it. - Original Message From: Tod To: solr-user@lucene.apache.org Sent: Fri, June 18, 2010 8:51:02 AM Subject: Data Import Handler Rich Format Documents I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching. This article at Lucid: href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; target=_blank http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too. What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr. Thanks in advance. - Tod
Re: Data Import Handler Rich Format Documents
Tod, You didn't mention Tika, which makes me think you are not aware of it... You could implement a custom Transformer that uses Tika to perform rich doc text extraction, just like ExtractingRequestHandler does it (see http://wiki.apache.org/solr/ExtractingRequestHandler ). Maybe you could even just call ERH from your Transformer, though that wouldn't be the most efficient. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Tod > To: solr-user@lucene.apache.org > Sent: Fri, June 18, 2010 8:51:02 AM > Subject: Data Import Handler Rich Format Documents > > I have a database containing Metadata from a content management system. > Part of that data includes a URL pointing to the actual published document > which > can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already > indexing the Metadata and that provides a lot of value. The customer > however would like that the content pointed to by the URL also be indexed for > more discrete searching. This article at Lucid: > href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS"; > > target=_blank > >http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes > the process of coding a custom transformer. A separate article I've read > implies Nutch could be used to provide this functionality too. What would > be the best and most efficient way to accomplish what I'm trying to do? I > have a feeling the Lucid article might be dated and there might ways to do > this > now without any coding and maybe without even needing to use Nutch. I'm > using the current release version of Solr. Thanks in > advance. - Tod
Data Import Handler Rich Format Documents
I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching. This article at Lucid: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too. What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr. Thanks in advance. - Tod