Re: Opinions on ExtractingRequestHandler

2018-02-08 Thread Charlie Hull
On 08/02/2018 11:47, Frederik Van Hoyweghen wrote: Hey everyone, What are your experiences on making (in production) use of Solr's ExtractingRequestHandler? I've been reading some mixed remarks so I was wondering what your actual experiences with it are. Personally, I feel like se

Re: Opinions on ExtractingRequestHandler

2018-02-08 Thread Sreenivas.T
tion needed "ExtractingRequestHandler" should be fine in production too. Regards, Sreenivas On 8 February 2018 at 17:17, Frederik Van Hoyweghen < frederik.vanhoyweg...@chapoo.com> wrote: > Hey everyone, > > What are your experiences on making (in production) use of Solr

Opinions on ExtractingRequestHandler

2018-02-08 Thread Frederik Van Hoyweghen
Hey everyone, What are your experiences on making (in production) use of Solr's ExtractingRequestHandler? I've been reading some mixed remarks so I was wondering what your actual experiences with it are. Personally, I feel like setting up a separate service which is solely respo

ExtractingRequestHandler AND NLP

2016-11-16 Thread Kyle W. Bolin
I am trying to implement the NLP functionality within the Solr ExtractingRequestHandler and the Tika framework I am using PDF documents to index and have been successful in extracting and indexing the content but have not been successful in engaging the NLP routines. I have reached the

Re: Bypassing ExtractingRequestHandler

2016-06-13 Thread Justin Lee
Thanks everyone for the help and advice. The SolrJ exmaple makes sense to me. The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit after some time. Tim: for context, I'm ultimately trying to create an external highlighter. See https://issues.apache.org/jira/browse/SOLR

RE: Bypassing ExtractingRequestHandler

2016-06-13 Thread Allison, Timothy B.
>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should >be straightforward: http://searchhub.org/2012/02/14/indexing-with-solrj/ +1 > We tend to prefer running Tika externally as it's entirely possible > that Tika will crash or hang with certain files - and that will

Re: Bypassing ExtractingRequestHandler

2016-06-12 Thread Erick Erickson
est, Erick On Fri, Jun 10, 2016 at 1:22 AM, Charlie Hull wrote: > On 10/06/2016 02:20, Justin Lee wrote: >> >> Has anybody had any experience bypassing ExtractingRequestHandler and >> simply managing Tika manually? I want to make a small modification to >> Tika >&

Bug in ExtractingRequestHandler

2016-06-10 Thread Gilbert Boyreau
Hello, I think there's a bug in the |ExtractingRequestHandler|Handler (Tika parser). Some tika's exception are not catch, and the handler return a 0 status, indicating no problem's with that content. I give a look at the code (Solr 5.1, ExtractingDocumentLoader:221), only Ti

Re: Bypassing ExtractingRequestHandler

2016-06-10 Thread Charlie Hull
On 10/06/2016 02:20, Justin Lee wrote: Has anybody had any experience bypassing ExtractingRequestHandler and simply managing Tika manually? I want to make a small modification to Tika to get and save additional data from my PDFs, but I have been procrastinating in no small part due to the

Bypassing ExtractingRequestHandler

2016-06-09 Thread Justin Lee
Has anybody had any experience bypassing ExtractingRequestHandler and simply managing Tika manually? I want to make a small modification to Tika to get and save additional data from my PDFs, but I have been procrastinating in no small part due to the unpleasant prospect of setting up a

List of file types supported by ExtractingRequestHandler

2016-02-05 Thread Steven White
Hi everyone, Is there a publish list of Tika extractors and the file types supported that comes with Solr 5.2? For example, I noticed that the ASM JAR ( http://asm.ow2.org/) is not included with Solr. I can examine the JARs under /solr/contrib/extraction/lib/ and try to come up with the list, bu

Re: Get content in response from ExtractingRequestHandler

2015-07-15 Thread trung.ht
hu, Jul 9, 2015 at 7:53 PM, trung.ht wrote: > > Hi everyone, > > > > I use solr to index and search in office file (docx, pptx, ...). To > reduce > > the size of solr index, I do not store the content of the file on solr, > > however now my customer want to preview

Re: Get content in response from ExtractingRequestHandler

2015-07-10 Thread Erick Erickson
t of the file. > > I have read the document of ExtractingRequestHandler, but it seems that to > return content in the response from solr, the only option is to > set extractOnly=true, but in that case, solr would not index the file. > > My question is: is there anyway for sol

Get content in response from ExtractingRequestHandler

2015-07-09 Thread trung.ht
Hi everyone, I use solr to index and search in office file (docx, pptx, ...). To reduce the size of solr index, I do not store the content of the file on solr, however now my customer want to preview the content of the file. I have read the document of ExtractingRequestHandler, but it seems that

Solr ExtractingRequestHandler - Internal server Error

2014-10-14 Thread dev09
Hi, I am trying to index rich documents with ExtractingRequestHandler. So for configuration I have in solrconfig.xml (I put all the jar of contrib/extraction/lib in solr/lib) And - text true ignored_ true links ignored_ But when i launch curl "

Re: ExtractingRequestHandler indexing zip files

2014-09-11 Thread keeblerh
Working now - fyi - the "update/extract" from a post works extracting from a kmz(zip) but I am still having trouble from the dataimport. I'll move to another thread for that. THANKS all. -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestH

Re: ExtractingRequestHandler indexing zip files

2014-09-10 Thread keeblerh
ot;Patch has to be applied to the source code and compile again Solr.war. If you do that then it works extracting the content of documents " -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p4158024.html Sent from the

Re: ExtractingRequestHandler indexing zip files

2014-09-09 Thread marotosg
hi keeblerh, Patch has to be applied to the source code and compile again Solr.war. If you do that then it works extracting the content of documents Regards, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files

Re: ExtractingRequestHandler indexing zip files

2014-09-09 Thread keeblerh
nd getting patches to it are not trival. -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p4157650.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractingRequestHandler - extracted files caching?

2014-06-30 Thread Erick Erickson
fields must be stored="true". > > Regards, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Tue, Jul 1, 2014 at 5:55 AM, Gili Nachum wrote: >> Hello, >&

Re: ExtractingRequestHandler - extracted files caching?

2014-06-30 Thread Alexandre Rafalovitch
"true". Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Jul 1, 2014 at 5:55 AM, Gili Nachum wrote: > Hello, > > I plan to use ExtractingRequestHandler to index binary files

ExtractingRequestHandler - extracted files caching?

2014-06-30 Thread Gili Nachum
Hello, I plan to use ExtractingRequestHandler to index binary files text plus app metadata (like literal.downloadCount and others) into a single document. I expect the app metadata to change much more often than the binary file itself. I would hate to have to extract text from the binary file

Re: ExtractingRequestHandler indexing zip files

2014-05-28 Thread marotosg
this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p4138427.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractingRequestHandler indexing zip files

2014-05-27 Thread Siegfried Goeschl
Hi Sergio, your either do the stuff on the caller side (which is probably a good idea since you are off-load the SOLR server) or extend the ExtractingRequestHandler Cheers, Siegfried Goeschl On 27 May 2014, at 10:37, marotosg wrote: > Hi, > > Thanks for your answer Alexandre. >

Re: ExtractingRequestHandler indexing zip files

2014-05-27 Thread marotosg
View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p4138255.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractingRequestHandler indexing zip files

2014-05-26 Thread Alexandre Rafalovitch
://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, May 26, 2014 at 11:21 PM, marotosg wrote: > Hi, > > I am using ExtractingRequestHandler to be able to index different type of > documents (doc,pdf,txt,html) > but when

ExtractingRequestHandler indexing zip files

2014-05-26 Thread marotosg
Hi, I am using ExtractingRequestHandler to be able to index different type of documents (doc,pdf,txt,html) but when I try to index compressed files like zip files solr returns the name of the file inside the field which I am using to map the content. Any idea is this is actually working? I

Solr ExtractingRequestHandler XPath

2014-04-08 Thread Lucas .
Hi, I'm trying to use ExtractingRequestHandler with XPath parameter but this doesnt work me for -> http://wiki.apache.org/solr/ExtractingRequestHandler#XPath with this &xpath=/xhtml:html/xhtml:body/descendant:node() it's seem to work, but when i try with something like

Re: Update existing documents when using ExtractingRequestHandler?

2013-10-14 Thread Jeroen Steggink
I have a document and an attachment. The document contains the meta data and the attachment the actual data. I would like to combine data of both in one Solr document. I have thought of several options: 1. Using ExtractingRequestHandler I would extract the data (extractOnly) and combine it wi

Re: Update existing documents when using ExtractingRequestHandler?

2013-10-10 Thread Jason Hellman
actual data. >> I would like to combine data of both in one Solr document. >> >> I have thought of several options: >> >> 1. Using ExtractingRequestHandler I would extract the data (extractOnly) >> and combine it with the meta data and send it to Solr. >>

Re: Update existing documents when using ExtractingRequestHandler?

2013-10-10 Thread Erick Erickson
nt and an attachment. The > document contains the meta data and the attachment the actual data. > I would like to combine data of both in one Solr document. > > I have thought of several options: > > 1. Using ExtractingRequestHandler I would extract the data (extractOnly) > and com

Update existing documents when using ExtractingRequestHandler?

2013-10-09 Thread Jeroen Steggink
Hi, In a content management system I have a document and an attachment. The document contains the meta data and the attachment the actual data. I would like to combine data of both in one Solr document. I have thought of several options: 1. Using ExtractingRequestHandler I would extract the

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson
r client, isn't there anything that could be done using the PHP Solr > client only? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077893.html > Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan
Thanks for the link. Also, having gone quite far with my work using the PHP Solr client, isn't there anything that could be done using the PHP Solr client only? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHa

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson
xed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html > Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan
Sorry, but did you forget to send me the example's link? -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson
gt; > //fire the curl request here referring to the file at $data->filepath > $doc->addField ('filecontent' , //content of the pdf file); > > Also, instead of firing the raw cURL request, is there a better way? I don't > know if the current PECL SOLR Clien

Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan
there a better way? I don't know if the current PECL SOLR Client 1.0.2 has the feature of indexing pdf files. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856.html Sent from the Solr - User mailing list archive at Nabble.com.

ExtractingRequestHandler literals

2013-02-08 Thread marotosg
Hi, I am trying to index some documents using ExtractingRequestHandler and tika. Solr 3.6 I would like to add some extra data coming from a different source using literal. My schema contains these fields My url http://dzoagent001:8080/solr/document/update/extract?commit=true&stream.

Re: XPath with ExtractingRequestHandler

2013-01-19 Thread Arcadius Ahouansou
Hi Mike. I am going through this too. How did you solve this? Thanks. Arcadius. On 15 December 2011 12:49, Michael Kelleher wrote: > Yeah, I tried: > > > //xhtml:div[@class='**bibliographicData']/**descendant:node() > > also tried > > //xhtml:div[@class='**bibliographicData'] > > Neither wo

RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-12-07 Thread Brett Melbourne
erick...@gmail.com] Sent: Tuesday, November 27, 2012 7:38 AM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Not an issue that I know of. I expect you've got some obscure problem in your definitions, b

IOFileUploadException(Too many open files) occurs while indexing using ExtractingRequestHandler

2012-11-29 Thread Shigeki Kobayashi
Hello everyone I use ManifoldCF (File Crawler) to crawl and index file contents into Solr3.6. ManifoldCF uses ExtractingRequestHandler to extract contents from files. Somehow IOFileUploadException occurs and tells there are too many open files. Does Solr open temporary files under /var/tmp/ a

Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-27 Thread Erick Erickson
> 2009-04-16T11:32:00 > > > 2012-11-23T00:29:39.73 > > > ... > > > > > Brett. > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Sunday, November 25, 2012 9:27 PM > To: solr-user@lucene.apache.org >

RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-26 Thread Brett Melbourne
... Brett. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, November 25, 2012 9:27 PM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Did you commit after you adde

Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-25 Thread Erick Erickson
ent from ODT (Open Office Document) files submitted to the > ExtractingRequestHandler. I can reproduce this issue against the example > schema running with jetty. > > Executing a simple index request (based on the example in the wiki): > curl " > http://localhost:8983/solr/upd

Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-23 Thread Brett Melbourne
Hi all, I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty. Executing a simple index request (based on

Re: ExtractingRequestHandler causes Out of Memory Error

2012-10-03 Thread Jan Høydahl
Hi, If you like, you can open a JIRA issue on this and provide as much info as possible. Someone can then look into (potential) memory optimization of this part of the code. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 28. sep.

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Shigeki Kobayashi
Hi Jan. Thank you very much for your advice. So I understand Solr needs more memory to parse the files. To parse a file of size x, it needs double memory (2x). Then how much memory allocation should be taken to heap size? 8x? 16x? Regards, Shigeki 2012/9/28 Jan Høydahl > Please try to incr

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Jan Høydahl
Please try to increase -Xmx and see how much RAM you need for it to succeed. I believe it is simply a case where this particular file needs double memory (480Mb) to parse and you have only allocated 1Gb (which is not particularly much). Perhaps the code could be optimized to avoid the Arrays.cop

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Lance Norskog
ot; | To: solr-user@lucene.apache.org | Sent: Thursday, September 27, 2012 2:22:06 AM | Subject: ExtractingRequestHandler causes Out of Memory Error | | Hi guys, | | | I use Manifold CF to crawl files in Windows file server and index | them to | Solr using Extracting Request Handler. | Most of the docum

ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Shigeki Kobayashi
Hi guys, I use Manifold CF to crawl files in Windows file server and index them to Solr using Extracting Request Handler. Most of the documents are succesfully indexed but some are failed and Out of Memory Error occurs in Solr, so I need some advice. Those failed files are not so big and they ar

Re: LanguageDetection inside of ExtractingRequestHandler

2012-06-20 Thread Jan Høydahl
Hi, In my opinion, instead of hardcoding such functionality into multiple request handlers, we should go the opposite direction -> modularization, factoring out Tika extraction into its own UpdateProcessor (https://issues.apache.org/jira/browse/SOLR-1763). Then the ExtractingRequestHand

LanguageDetection inside of ExtractingRequestHandler

2012-06-19 Thread Martin Ruckli
Hi all, I just wanted to check if there is a demand for this feature. I had to implement this functionality for one of our customers and would like to contribute it. Here is the use case: We are using the ExtractingRequestHandler with the extractOnly=true flag set. With a request to this

Re: using Tika (ExtractingRequestHandler)

2012-06-05 Thread Jack Krupansky
Hoss, In your edit, I noticed that the wiki makes "SolrPlugin" a link, but to a nonexistent page, although the page "SolrPlugins" does exist. See: "it is provided as a SolrPlugin," http://wiki.apache.org/solr/ExtractingRequestHandler I also noticed a few othe

Re: using Tika (ExtractingRequestHandler)

2012-06-05 Thread Chris Hostetter
I've updated the wiki to try and fill in some of these holes... http://wiki.apache.org/solr/ExtractingRequestHandler : i'm looking at using Tika to index a bunch of documents. the wiki page seems to be a little bit out of date ("// TODO: this is out of date as of Solr 1.4 - d

Re: Tika ExtractingRequestHandler and field postprocessing

2012-05-27 Thread Jack Krupansky
solr-user@lucene.apache.org Subject: Tika ExtractingRequestHandler and field postprocessing Hi, I use Tika through the Solr ExtractingRequestHandler and I face a very common use case namely: postprocessing fields from Tika in order to normalize their values or override them with explicitly passe

Tika ExtractingRequestHandler and field postprocessing

2012-05-27 Thread Raphaël
Hi, I use Tika through the Solr ExtractingRequestHandler and I face a very common use case namely: postprocessing fields from Tika in order to normalize their values or override them with explicitly passed "literal" values. With exception of some vagues statements about "Con

Re: using Tika (ExtractingRequestHandler)

2012-05-17 Thread Ahmet Arslan
> i'm looking at using Tika to index a > bunch of documents. the wiki page seems to be a little bit > out of date ("// TODO: this is out of date as of Solr 1.4 - > dist/apache-solr-cell-1.4.jar and all of > contrib/extraction/lib are needed") and it also looks a > little incomplete. > > is there a

using Tika (ExtractingRequestHandler)

2012-05-17 Thread Welty, Richard
i'm looking at using Tika to index a bunch of documents. the wiki page seems to be a little bit out of date ("// TODO: this is out of date as of Solr 1.4 - dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib are needed") and it also looks a little incomplete. is there an actual list

Re: ExtractingRequestHandler

2012-04-03 Thread Ravish Bhagdev
ills down (e.g. file path, database > PK, etc). Would that work for your situation? > > Best > Erick > > On Sat, Mar 31, 2012 at 3:55 PM, wrote: > > Hi, > > > > I want to index various filetypes in solr, this can easily done with > > ExtractingRequestHandler

RE: ExtractingRequestHandler

2012-04-02 Thread spring
> Solr Cell is great for proof-of-concept, but for heavy-duty > applications, > you're offloading all the processing on the Solr server, > which can be a > problem. Good point! Thank you

Re: ExtractingRequestHandler

2012-04-01 Thread Bill Bell
te: >> Hi, >> >> I want to index various filetypes in solr, this can easily done with >> ExtractingRequestHandler. But I also need the extracted content back. >> I know ext.extract.only but then nothing gets indexed, right? >> >> Can I index the document AND get the content back as with ext.extract.only? >> In a single request? >> >> Thank you >> >>

Re: ExtractingRequestHandler

2012-04-01 Thread Erick Erickson
Ahhh, OK. Sure, anything you store in Solr you can get back. The key is not Tika, but your schema.xml file, and setting 'stored="true" ' bq: So my question was if I can index the original doc via ExtractingRequestHandler in Solr AND get back the text output, in a single call

RE: ExtractingRequestHandler

2012-04-01 Thread spring
doc is NOT stored in solr. So my question was if I can index the original doc via ExtractingRequestHandler in Solr AND get back the text output, in a single call. AFAIK I can do it only in 2 calls: 1) ExtractingRequestHandler?ext.extract.only=true -> Text 2) Index the text from 1) in solr

Re: ExtractingRequestHandler

2012-04-01 Thread Erick Erickson
: > Hi, > > I want to index various filetypes in solr, this can easily done with > ExtractingRequestHandler. But I also need the extracted content back. > I know ext.extract.only but then nothing gets indexed, right? > > Can I index the document AND get the content back as with e

ExtractingRequestHandler

2012-03-31 Thread spring
Hi, I want to index various filetypes in solr, this can easily done with ExtractingRequestHandler. But I also need the extracted content back. I know ext.extract.only but then nothing gets indexed, right? Can I index the document AND get the content back as with ext.extract.only? In a single

Re: XPathEntityProcessor and ExtractingRequestHandler

2011-12-28 Thread Chris Hostetter
: Can I use a XPathEntityProcessor in conjunction with an : ExtractingRequestHandler? Also, the scripting language that : XPathEntityProcessor uses/supports, is that just ECMA/JavaScript? : : Or is XPathEntityProcessor only supported for use in conjuntion with the : DataImportHandler? The

Re: Mapping and Capture in ExtractingRequestHandler

2011-12-21 Thread Erick Erickson
it to construct a Solr document ? > > Thanks and Regards, > Swapna. > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wednesday, December 21, 2011 2:28 AM > To: solr-user@lucene.apache.org > Subject: Re: Mapping and Capture in Extract

RE: Mapping and Capture in ExtractingRequestHandler

2011-12-20 Thread Swapna Vuppala
@lucene.apache.org Subject: Re: Mapping and Capture in ExtractingRequestHandler When you start getting into complex HTML extraction, you're probably better off using a SolrJ program with a forgiving HTML parser and extracting the relevant bits yourself and construction a SolrDocument. FWIW,

Re: Mapping and Capture in ExtractingRequestHandler

2011-12-20 Thread Erick Erickson
i, > > I understand that we can specify parameters in ExtractingRequestHandler in > solrconfig.xml to capture HTML tags of a particular type and map them to > desired solr fields, like something below. > > div > mysolrfield > > The above setting will capture content in "

Mapping and Capture in ExtractingRequestHandler

2011-12-19 Thread Swapna Vuppala
Hi, I understand that we can specify parameters in ExtractingRequestHandler in solrconfig.xml to capture HTML tags of a particular type and map them to desired solr fields, like something below. div mysolrfield The above setting will capture content in "div" tags and copy to the

Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Michael Kelleher
Yeah, I tried: //xhtml:div[@class='bibliographicData']/descendant:node() also tried //xhtml:div[@class='bibliographicData'] Neither worked. The DIV I need also had an ID value, and I tried both variations on ID as well. Still nothing. XPath handling for Tika seems to be pretty basic and

Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Péter Király
Hi, maybe I am wrong, but the // should be at the beggining of the expression, like //xhtml:div[@class='bibliographicData']/descendant:node(), or if you want to search the div inside body, you have to use descendant like /xhtml:html/xhtml:body/descendant::xhtml:div[@class='bibliographicData']/desc

XPath with ExtractingRequestHandler

2011-12-14 Thread Michael Kelleher
I want to restrict the HTML that is returned by Tika to basically: /xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node() and it seems that the XPath class being used does not support the '//' syntax. Is there anyway to configure Tika to use a different XPath e

ExtractingRequestHandler and HTML

2011-12-12 Thread Michael Kelleher
I am submitting HTML document to Solr using the ERH. Is it possible to store the contents of the document (including all markup) into a field? Using fmap.content (I am assuming this comes from Tika) stores the extracted text of the document in a field, but not the markup. I want the whole un

XPathEntityProcessor and ExtractingRequestHandler

2011-12-07 Thread Michael Kelleher
Can I use a XPathEntityProcessor in conjunction with an ExtractingRequestHandler? Also, the scripting language that XPathEntityProcessor uses/supports, is that just ECMA/JavaScript? Or is XPathEntityProcessor only supported for use in conjuntion with the DataImportHandler? Thanks.

Re: ExtractingRequestHandler HTTP GET Problem

2011-11-17 Thread Chris Hostetter
: indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP : GET request. Because of that I'll get a "socket write error". If I : change the CommonsHttpSolrServer to send the parameters as HTTP POST : sending will work, but the ExtractingRequestHandler will not

ExtractingRequestHandler HTTP GET Problem

2011-11-09 Thread Felix Remmel
Hi, I've a problem with the ExtractingRequestHandler of Solr. I want to send a really big base64 encoded string to Solr with the CommonsHttpSolrServer. The base64 encoded string is the contet of the indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP GET request. Because of

Re: form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-11-02 Thread kgoess
$extract_uri, Content_Type => 'form-data', Content => \@content, ); -- View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3474450.html Sent from the Solr - User mailing list archive at Nabble.com.

form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-10-28 Thread kgoess
I'm trying to post a PDF along with a whole bunch of metadata fields to the ExtractingRequestHandler as multipart/form-data. It works fine except for the utf-8 character handling. Here is what my post looks like (abridged): POST /solr/update/extract HTTP/1.1 TE: deflate,gzip;

Re: bug in ExtractingRequestHandler with PDFs and metadata field Category

2011-07-07 Thread Juan Grande
the "Category" field I have in the schema with the Category > metadata from PDF > This is the expected behavior, as it's described in http://wiki.apache.org/solr/ExtractingRequestHandler: uprefix= - Prefix all fields that are not defined in the schema with > the given

bug in ExtractingRequestHandler with PDFs and metadata field Category

2011-07-07 Thread Andras Balogh
Hi, I think this is a bug but before reporting to issue tracker I thought I will ask it here first. So the problem is I have a PDF file which among other metadata fields like Author, CreatedDate etc. has a metadata field Category (I can see all metadata fields with tika-app.jar started in

Re: ExtractingRequestHandler - renaming tika generated fields

2011-06-09 Thread Jan Høydahl
One solution to this problem is to change the order of field operation (http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations) to first do fmap.*= processing, then add the fields from literal.*=. Why would anyone want to rename a field they just have explicitly named

ExtractingRequestHandler - renaming tika generated fields

2011-06-09 Thread Jan Høydahl
Hi, I post a PDF from a CMS client, which has metadata about the document. One of those metadata is the title. I trust the title of the CMS more than the title extracted from the PDF, but I cannot find a way to both send &literal.title=CMS-Title as well as changing the name of the title field

Re: Can ExtractingRequestHandler ignore documents metadata

2011-05-11 Thread Grant Ingersoll
You can map the attributes to the ignore field. Alternatively, override the SolrContentHandler's newMethod() method to skip adding them. Come to think of it, I'll put up a quick patch that breaks that out a bit more and makes it easier to override. Longer term, a patch to exclude metadata wou

Can ExtractingRequestHandler ignore documents metadata

2011-05-09 Thread Tod
I'm indexing content from a CMS' database of metadata. The client would prefer that Solr exclude the properties (metadata) of any documents being indexed. Is there a way to tell Tika to only index a document's text and not its properties? Thanks - Tod

Re: ExtractingRequestHandler and Solr 3.1

2011-04-14 Thread Liam O'Boyle
een very smooth and > > painless, I'm having a minor issue with the ExtractingRequestHandler. > > > > The problem is that it's inserting metadata into the extracted > > content, as well as mapping it to a dynamic field. Previously the > > same configuration only mapped

Re: ExtractingRequestHandler and Solr 3.1

2011-04-13 Thread Grant Ingersoll
On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote: > Afternoon, > > After an upgrade to Solr 3.1 which has largely been very smooth and > painless, I'm having a minor issue with the ExtractingRequestHandler. > > The problem is that it's inserting metadata into

ExtractingRequestHandler and Solr 3.1

2011-04-13 Thread Liam O'Boyle
Afternoon, After an upgrade to Solr 3.1 which has largely been very smooth and painless, I'm having a minor issue with the ExtractingRequestHandler. The problem is that it's inserting metadata into the extracted content, as well as mapping it to a dynamic field. Previousl

Re: Tika config in ExtractingRequestHandler

2011-01-27 Thread Lance Norskog
or Solr and thought I had > to invoke some Tika functionality by this configuration file in order to > do so, but found out that I could rewrite some of the > ExtractingRequestHandler classes instead. > > Erlend > > On 27.01.11 16.12, Adam Estrada wrote: >> I believe t

Re: Tika config in ExtractingRequestHandler

2011-01-27 Thread Erlend Garåsen
hat I could rewrite some of the ExtractingRequestHandler classes instead. Erlend On 27.01.11 16.12, Adam Estrada wrote: I believe that as along as Tika is included in a folder that is referenced by solrconfig.xml you should be good. Solr will automatically throw mime types to Tika for parsing. Can a

Re: Tika config in ExtractingRequestHandler

2011-01-27 Thread Adam Estrada
ge for the ExtractingRequestHandler says that I can add the > following configuration: > /my/path/to/tika.config > > I have tried to google for an example of such a Tika config file, but > haven't found anything. > > Erlend > > -- > Erlend Garåsen > Center for Inf

Tika config in ExtractingRequestHandler

2011-01-27 Thread Erlend Garåsen
The wiki page for the ExtractingRequestHandler says that I can add the following configuration: /my/path/to/tika.config I have tried to google for an example of such a Tika config file, but haven't found anything. Erlend -- Erlend Garåsen Center for Information Technology Ser

Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-16 Thread Lance Norskog
You need to add another parameter which defines the 'id' field. 'id' is required- it is unique for every document. Usually you can pick the filename. Lance On Fri, Jan 14, 2011 at 3:59 AM, Jörg Agatz wrote: > ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i > have a Oh

Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Jörg Agatz
no, i dont know that is the request Hadler: last_modified true text true ignored_ true links ignored_ and i start it like this: curl " http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text

Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Stefan Matheis
pass an value for your id-field as you do it already for all the other fields? http://search.lucidimagination.com/search/document/ca95d06e700322ed/missing_required_field_id_using_extractingrequesthandler On Fri, Jan 14, 2011 at 12:59 PM, Jörg Agatz wrote: > ok, now in the 4 test, it works ? ok..

Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Jörg Agatz
ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i have a Oher Problem, i cant sent content to the Server.. when i will send Content to solr i get: Error 400 HTTP ERROR: 400Document [null] missing required field: id RequestURI=/solr/update/extracthttp://jetty.mortb

Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Jörg Agatz
Hallo, I will indexig fulltext Documents, so i read, that Tika is a god idea :-) so i try the How to from lucidimagination ( http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika ) first of all, i install Maven2, and mvn Tika, i have test Tika in shell

Runnig ExtractingRequestHandler from /multicore/core0 (lucidworks for solr 1.4.1)

2010-12-17 Thread Wodek Siebor
http://lucene.472066.n3.nabble.com/Runnig-ExtractingRequestHandler-from-multicore-core0-lucidworks-for-solr-1-4-1-tp2105744p2105744.html Sent from the Solr - User mailing list archive at Nabble.com.

ExtractingRequestHandler configuration

2010-12-05 Thread alessandro.ri...@virgilio.it
Hi All, I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration that I found on the wiki (http://wiki.apache.org/solr/ExtractingRequestHandler). last_modified ignored_ now when I injest via solrj api the html and

  1   2   3   >