[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163485#comment-14163485 ] Simon Endele commented on SOLR-1763: I'd appreciate this feature, because it would also be possible to post-process the output of Tika. Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Labels: extracting_request_handler, solr_cell, tika, update_request_handler From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566509#comment-13566509 ] Jan Høydahl commented on SOLR-1763: --- Anyone interested in this feature? Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Labels: extracting_request_handler, solr_cell, tika, update_request_handler From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402149#comment-13402149 ] Jan Høydahl commented on SOLR-1763: --- I won't have time to look at this before october-ish, so anyone feel free to give it a shot :) Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl Assignee: Jan Høydahl Labels: extracting_request_handler, solr_cell, tika, update_request_handler From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904979#action_12904979 ] Jan Høydahl commented on SOLR-1763: --- Ideally the UpdateProcessor will do everything that the RequestHandler does and more. We might still need a RequestHandler which is capable of accepting a binary file as input, as well as conveying certain request parameters to the UpdateProcessor. But that should probably be a new thinner RawUpdateRequestHandler. When this more generic architecture has proven itself superior, then we can start deprecating old stuff. DIH should then also start looking to the UpdateProcessor for its Tika needs. Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899548#action_12899548 ] Jan Høydahl commented on SOLR-1763: --- Starting to look into this one. Will it make most sense to make the patch against contrib/extraction since it depends on the Tika jars? Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843217#action_12843217 ] Jan Høydahl commented on SOLR-1763: --- I may have a need for this functionality in an upcoming project. Anyone knowing the code who can estimate the effort? Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor
[ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831108#action_12831108 ] Jan Høydahl commented on SOLR-1763: --- Re-posting my comment from solr-dev in this ticket: Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing. With this, we can mix and match any type of content source with other processing needs. I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain. Examples: Crawler (httpheaders, binarybody) - TikaUpdateProcessor (+title, +text, +meta...) - index XML (title, pdfurl) - GetUrlProcessor (+pdfbin) - TikaUpdateProcessor (+text, +meta) - index DIH (city, street, lat, lon) - LatLon2GeoHashProcessor (+geohash) - index I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com Integrate Solr Cell/Tika as an UpdateRequestProcessor - Key: SOLR-1763 URL: https://issues.apache.org/jira/browse/SOLR-1763 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl From Chris Hostetter's original post in solr-dev: As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields. Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields. Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial? -Hoss I couldn't agree more, so I decided to add it as an issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.