Jukka,

Thanks for your response.

I want to make sure that I am really running in streaming mode.  I am doing all 
tests with 1 thread for a basic baseline memory usage for different documents, 
then I will work on multiple threads which should be close to n multiples, I 
would expect.

Can you tell me if streaming mode is more than just using the InputStream to 
Tika?

I am using tika like demonstrated below in this example using the 
AutoDetectParser and passing it and InputStream which will actually be an 
instance of a ByteArrayInputStream or a FileInputStream. :

        private Parser m_parser = new AutoDetectParser();
        public String getText(InputStream is) throws DocumentHandlerException 
        {
                Metadata metadata = new Metadata();
                ContentHandler handler = new BodyContentHandler();
                try
                {
                        m_parser.parse(is, handler, metadata);
                        return handler.toString();
                } 
                catch (Exception e) 
                {
                        throw new DocumentHandlerException("Cannot extract text 
from the document", e);
                }
        }

Thanks in advance,

David


-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Sent: Thursday, January 21, 2010 5:47 PM
To: tika-user@lucene.apache.org
Subject: Re: Memory Usage/needs for file sizes/types

Hi,

Sorry for the late response...

On Tue, Jan 5, 2010 at 5:47 PM, Baldwin, David <david_bald...@bmc.com> wrote:
> I need to get a handle on how much memory Tika needs to token-ize different
> file types.  In other words, I need to find information on required overhead
> (including copies of buffers made if applicable) so that I can produce some
> kind of guidelines for memory possibly needed by users of the product I am
> working on which uses Lucene/Tika.

Assuming you use Tika in streaming mode, then the memory use is
moderate and typically does not depend on the size of the document
being processed. Parsing complex documents like MS Office or PDF files
can require up to a few megabytes of memory,  while simple formats
like plain text only need a few kilobytes of memory.

In addition to the above estimates, you also need to take into account
the memory that the JVM needs for loading all the Tika and relevant
parser library classes needed. In total I'd estimate that you should
get pretty far with some 20 megs of memory for Tika unless you have
dozens or more parallel Tika parsing tasks running concurrently.

BR,

Jukka Zitting

Reply via email to