Hello,

I am tried to index log files (all text data) stored in file system. Data
can be as big as 1000 GBs or more. I am working on windows.

A sample file can be found at
https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441

I tried using FileListEntityProcessor with TikaEntityProcessor which ended
up in java heap exception and couldn't get rid of it no matter how much I
increase my ram size.
data-confilg.xml

<dataConfig>
    <dataSource name="bin" type="FileDataSource" />
    <document>
        <entity name="f" dataSource="null" rootEntity="true"
            processor="FileListEntityProcessor"
transformer="TemplateTransformer"
            baseDir="//mathworks/devel/bat/A/logs/66048/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size"/>
            <field column="fileLastModified" name="lastmodified" />

            <entity name="file" dataSource="bin"
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip" transformer="TemplateTransformer"
           rootEntity="true">
                <field column="text" name="text"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

Then i used FileListEntityProcessor with LineEntityProcessor which never
stopped indexing even after 40 hours or so.

data-config.xml

<dataConfig>
    <dataSource name="bin" type="FileDataSource" />
    <document>
        <entity name="f" dataSource="null" rootEntity="true"
            processor="FileListEntityProcessor"
transformer="TemplateTransformer"
            baseDir="//mathworks/devel/bat/A/logs/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size"/>
            <field column="fileLastModified" name="lastmodified" />

            <entity name="file" dataSource="bin"
processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
onError="skip"
           rootEntity="true">
                <field column="content" name="rawLine"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

Is there any way i can use post.jar to index text file recursively. Or any
other way which works without java heap exception and doesn't take days to
index.

I am completely stuck here. Any help would be greatly appreciated.

Thanks,
Prerna



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to