[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2014-10-08 Thread Simon Endele (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163485#comment-14163485
 ] 

Simon Endele commented on SOLR-1763:


I'd appreciate this feature, because it would also be possible to post-process 
the output of Tika.

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
  Labels: extracting_request_handler, solr_cell, tika, 
 update_request_handler

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2013-01-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566509#comment-13566509
 ] 

Jan Høydahl commented on SOLR-1763:
---

Anyone interested in this feature?

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Jan Høydahl
  Labels: extracting_request_handler, solr_cell, tika, 
 update_request_handler

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2012-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402149#comment-13402149
 ] 

Jan Høydahl commented on SOLR-1763:
---

I won't have time to look at this before october-ish, so anyone feel free to 
give it a shot :)

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Jan Høydahl
  Labels: extracting_request_handler, solr_cell, tika, 
 update_request_handler

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2010-09-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904979#action_12904979
 ] 

Jan Høydahl commented on SOLR-1763:
---

Ideally the UpdateProcessor will do everything that the RequestHandler does and 
more.
We might still need a RequestHandler which is capable of accepting a binary 
file as input, as well as conveying certain request parameters to the 
UpdateProcessor.
But that should probably be a new thinner RawUpdateRequestHandler.

When this more generic architecture has proven itself superior, then we can 
start deprecating old stuff. DIH should then also start looking to the 
UpdateProcessor for its Tika needs.

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2010-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899548#action_12899548
 ] 

Jan Høydahl commented on SOLR-1763:
---

Starting to look into this one. Will it make most sense to make the patch 
against contrib/extraction since it depends on the Tika jars?

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2010-03-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843217#action_12843217
 ] 

Jan Høydahl commented on SOLR-1763:
---

I may have a need for this functionality in an upcoming project. Anyone knowing 
the code who can estimate the effort?

 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

2010-02-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831108#action_12831108
 ] 

Jan Høydahl commented on SOLR-1763:
---

Re-posting my comment from solr-dev in this ticket:
Good match. UpdateProcessors is the way to go for functionality which modifiy 
documents prior to indexing.
With this, we can mix and match any type of content source with other 
processing needs.

I think it can be neneficial to have the choice to do extration on the SolrJ 
side. But you don't always have that choice, if your source is a crawler 
without built-in Tika, some base64 encoded field in an XML or some other random 
source, you want to do the extraction at an arbitrary place in the chain.

Examples:
 Crawler (httpheaders, binarybody) - TikaUpdateProcessor (+title, +text, 
+meta...) - index
 XML (title, pdfurl) - GetUrlProcessor (+pdfbin) - TikaUpdateProcessor 
(+text, +meta) - index
 DIH (city, street, lat, lon) - LatLon2GeoHashProcessor (+geohash) - index

I propose to model the document processor chain more after FAST ESP's flexible 
processing chain, which must be seen as an industry best practice. I'm thinking 
of starting a Wiki page to model what direction we should go.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com


 Integrate Solr Cell/Tika as an UpdateRequestProcessor
 -

 Key: SOLR-1763
 URL: https://issues.apache.org/jira/browse/SOLR-1763
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl

 From Chris Hostetter's original post in solr-dev:
 As someone with very little knowledge of Solr Cell and/or Tika, I find myself 
 wondering if ExtractingRequestHandler would make more sense as an 
 extractingUpdateProcessor -- where it could be configured to take take either 
 binary fields (or string fields containing URLs) out of the Documents, parse 
 them with tika, and add the various XPath matching hunks of text back into 
 the document as new fields.
 Then ExtractingRequestHandler just becomes a handler that slurps up it's 
 ContentStreams and adds them as binary data fields and adds the other literal 
 params as fields.
 Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
 in XML and CSV based updates fairly trivial?
 -Hoss
 I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.