Re: How to index the parsed content effectively

Sergey Beryozkin Mon, 14 Jul 2014 14:03:22 -0700

Hi Tim

On 14/07/14 14:53, Allison, Timothy B. wrote:

Hi Sergey,

Now, we already have the original PDF occupying some space, so
duplicating it (its content) with a Document with Store.YES fields may
not be the best idea in some cases.


In some cases, agreed, but in general, this is probably a good default idea.  
As you point out, you aren't quite duplicating the document -- one copy contain 
the original bytes, and the other contains the text (and metadata?) that was 
extracted from the document.  One reason to store the content in the field is 
for easy highlighting.  You could configure the highlighter to pull the text 
content of the document from a db or other source, but that adds complexity and 
perhaps lookup time.  What you really would not want to do from a time 
perspective is ask Tika to parse the raw bytes to pull the content for 
highlighting at search time.  In general, Lucene's storage of the content is 
very reasonable; on one big batch of text files I have, the Lucene index with 
stored fields is the same size as the uncompressed text files.

OK. I'm sure Lucene is very good in what it does. I'm just trying tofigure out what the limits may be. By the way I apologize if it is offtopic. For now it seems though to me that Tika and Lucene can make aperfect combination, something SOLR and other implementations build uponAFAIK...

So I wonder, is it possible somehow for a given Tika Parser, lets say a
PDF parser, report, via the Metadata, the start and end indexes of the
content ? So the consumer will create say InputStreamReader for a
content region and will use Store.NO and this Reader ?


I don't think I quite understand what you're proposing.  The start and end 
indexes of the extracted content?  Wouldn't that just be 0 and the length of 
the string in most cases (beyond-bmp issues aside)?  Or, are you suggesting 
that there may be start and end indexes for content within the actual raw bytes 
of the PDF?  If the latter, for PDFs at least that would effectively require a 
full reparse ... if it were possible, and it probably wouldn't save much in 
time.  For other formats, where that might work, it would create far more 
complexity than value...IMHO.

Start and end indexes for content within the actual raw bytes... It'stheoretical for me at this point of time. I was thinking of this case:We have a PDF stored on the disk, Tika parsing the content against a NOPcontent handler, and providing these indexes, and we have a Document inthe memory only, using Reader to populate the content field.

When we restart we pay the penalty (the only penalty) of having Lucenerepopulating a Document from Reader, on the plus side the content isonly ever stored on the disk or in DB as part of the original PDF image.We'd only have to persist these indexes to avoid having Tika reparse again.

I think I won't worry about getting it all over-optimized at this stage,from what I understand working with Lucene means no major storage limitsexists :-)

In general, I'd say store the field.  Perhaps let the user choose to not store 
the field.

I guess in context of working with Tika I'd go for storing the fieldsfor now...


Thanks, Sergey


Always interested to hear input from others.

Best,

           Tim


-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Friday, July 11, 2014 1:38 PM
To: [email protected]
Subject: Re: How to index the parsed content effectively

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:

Hi Sergey,

    I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.


I'd like to ask something about using Tika parser and a Reader (and
Lucene Store.NO)

Consider a case where we have a service which accepts a very large PDF
file. This file will be stored on the disk or may be in some DB. And
this service will also use Tika to extract content and populate a Lucene
Document.
Now, we already have the original PDF occupying some space, so
duplicating it (its content) with a Document with Store.YES fields may
not be the best idea in some cases.

So I wonder, is it possible somehow for a given Tika Parser, lets say a
PDF parser, report, via the Metadata, the start and end indexes of the
content ? So the consumer will create say InputStreamReader for a
content region and will use Store.NO and this Reader ?

Does it really make sense at all ? I can create a minor enhancement
request for parsers getting the access to a low level info like the
start/stop delimiters of the content to report it ?

Cheers, Sergey


    Some thoughts:

    At the least, you could create a separate Lucene document for each 
container document and each of its embedded documents.

    You could also break large documents into logical sections and index those 
as separate documents; but that gets very use-case dependent.

      In practice, for many, many use cases I've come across, you can index quite large documents 
with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber."  There may be 
a hit at highlighting time for large docs depending on which highlighter you use.  In the old days, 
there used to be a 10k default limit on the number of tokens, but that is now long gone.

    For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.

   Cheers,

                Tim
-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Wednesday, July 02, 2014 8:27 AM
To: [email protected]
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

Re: How to index the parsed content effectively

Reply via email to