Hi,

On 02/07/14 14:05, Christian Reuschling wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If you want to have a try, we created a crawling Tika parser, which gives 
recursive, incremental
crawing capabilities to Tika. There we also implemented a handler as a 
decorator that writes into
a Lucene index.

Checkout 'Create a Lucene index' here:

https://github.com/leechcrawler/leech/blob/master/codeSnippets.md

Maybe also as a starting point by looking into the code
Thanks for a link. Our requirements are fairly simple, we want to provide a utility code for our users to do an effective enough indexing of the content passing via ContentHandler. We will check the code and see if something similar cam be applied to our case, will get back with the confirmation if yes...

Thanks, Sergey




best

Chris

On 02.07.2014 14:27, Sergey Beryozkin wrote:
Hi All,

We've been experimenting with indexing the parsed content in Lucene and our 
initial attempt was
to index the output from ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what strategies 
exist for a more
effective indexing/tokenization of the the possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique Lucene 
field every time
its characters(...) method is called, something I've been planning to 
experiment with.

The feedback will be appreciated Cheers, Sergey

- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:[email protected]  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                   Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD
PZQAn1jxxRejVGchXXoYA08BIA3ldOKH
=ulNT
-----END PGP SIGNATURE-----


Reply via email to