Hello, I am tried to index log files (all text data) stored in file system. Data can be as big as 1000 GBs or more. I am working on windows.
A sample file can be found at https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441 I tried using FileListEntityProcessor with TikaEntityProcessor which ended up in java heap exception and couldn't get rid of it no matter how much I increase my ram size. data-confilg.xml <dataConfig> <dataSource name="bin" type="FileDataSource" /> <document> <entity name="f" dataSource="null" rootEntity="true" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="//mathworks/devel/bat/A/logs/66048/" fileName=".*\.*" onError="skip" recursive="true"> <field column="fileAbsolutePath" name="path" /> <field column="fileSize" name="size"/> <field column="fileLastModified" name="lastmodified" /> <entity name="file" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip" transformer="TemplateTransformer" rootEntity="true"> <field column="text" name="text"/> </entity> </entity> </document> </dataConfig> Then i used FileListEntityProcessor with LineEntityProcessor which never stopped indexing even after 40 hours or so. data-config.xml <dataConfig> <dataSource name="bin" type="FileDataSource" /> <document> <entity name="f" dataSource="null" rootEntity="true" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="//mathworks/devel/bat/A/logs/" fileName=".*\.*" onError="skip" recursive="true"> <field column="fileAbsolutePath" name="path" /> <field column="fileSize" name="size"/> <field column="fileLastModified" name="lastmodified" /> <entity name="file" dataSource="bin" processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip" rootEntity="true"> <field column="content" name="rawLine"/> </entity> </entity> </document> </dataConfig> Is there any way i can use post.jar to index text file recursively. Or any other way which works without java heap exception and doesn't take days to index. I am completely stuck here. Any help would be greatly appreciated. Thanks, Prerna -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html Sent from the Solr - User mailing list archive at Nabble.com.