[
https://issues.apache.org/jira/browse/SOLR-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669932#action_12669932
]
Fergus McMenemie commented on SOLR-798:
---------------------------------------
Despite the above. In my experience, using solutions provided by messrs Verity
and Autonomy, letting the search engine walk a directory tree with millions of
documents always lets you down. It could take days to recover from some
situations. You have to manage the collection of files yourself and while doing
so build bulk insert/delete files (bif files) which are passed to the search
engine to control indexing. So it perhaps a blessing in disguise to see that
Solr wont even let me walk large directory trees.
I have a vague intention to write a DIH enhancement to implement reading BIF
files containing a list of add/delete instructions. I only my java was better!
However for the record, how large a directory tree were you able to walk? I am
currently walking about 40,000 documents. But this is only while messing about
trying to get a feel for Solr, this strategy could not be used in production.
> FileListEntityProcessor can't handle directories containing lots of files
> -------------------------------------------------------------------------
>
> Key: SOLR-798
> URL: https://issues.apache.org/jira/browse/SOLR-798
> Project: Solr
> Issue Type: Bug
> Components: contrib - DataImportHandler
> Reporter: Grant Ingersoll
> Priority: Minor
>
> The FileListEntityProcessor currently tries to process all documents in a
> single directory at once, and stores the results into a hashmap. On
> directories containing a large number of documents, this quickly causes
> OutOfMemory errors.
> Unfortunately, the typical fix to this is to hack FileFilter to do the work
> for you and always return false from the accept method. It may be possible
> to hook up some type of Producer/Consumer multithreaded FileFilter approach
> whereby the FileFilter blocks until the nextRow() mechanism requests another
> row, thereby avoiding the need to cache everything in the map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.