[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-23 Thread Karl Wright (JIRA)
[ https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053874#comment-13053874 ] Karl Wright commented on CONNECTORS-214: Functionally, it should be easy to ad

[jira] [Created] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-23 Thread JIRA
Add post-extraction inclusions and exclusions into the web connector Key: CONNECTORS-214 URL: https://issues.apache.org/jira/browse/CONNECTORS-214 Project: ManifoldCF Issue

Re: Excluding html files and following links

2011-06-23 Thread Karl Wright
Hi Erlend, I hope you are not seeing memory issues on large files with ManifoldCF itself. That should not happen, and if it does we need to figure out why. Solr memory issues, on the other hand, I can believe. If that is the problem, then I agree we should try to do something about it. Probably

Re: Excluding html files and following links

2011-06-23 Thread Erlend Garåsen
I will create a ticket today. Post filtering sounds like a good idea. Another thing. We are facing memory problems with huge documents. Maybe we should add another future in order to cope with such documents, for instance skip documents which exceed a preset size. We have discovered pdfs on 5

Re: Excluding html files and following links

2011-06-23 Thread Karl Wright
Have there been any further developments on this thread? Karl On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright wrote: > Sure.  But you've already convinced me we need a new feature. ;-) > > Karl > > On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen > wrote: >> >> Sure, I can create a ticket. But firs