Hi,
On 02/07/14 14:05, Christian Reuschling wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
If you want to have a try, we created a crawling Tika parser, which gives
recursive, incremental
crawing capabilities to Tika. There we also implemented a handler as a
decorator that writes into
a Lucene index.
Checkout 'Create a Lucene index' here:
https://github.com/leechcrawler/leech/blob/master/codeSnippets.md
Maybe also as a starting point by looking into the code
Thanks for a link. Our requirements are fairly simple, we want to
provide a utility code for our users to do an effective enough indexing
of the content passing via ContentHandler. We will check the code and
see if something similar cam be applied to our case, will get back with
the confirmation if yes...
Thanks, Sergey
best
Chris
On 02.07.2014 14:27, Sergey Beryozkin wrote:
Hi All,
We've been experimenting with indexing the parsed content in Lucene and our
initial attempt was
to index the output from ToTextContentHandler.toString() as a Lucene Text field.
This is unlikely to be effective for large files. So I wonder what strategies
exist for a more
effective indexing/tokenization of the the possibly large content.
Perhaps a custom ContentHandler can index content fragments in a unique Lucene
field every time
its characters(...) method is called, something I've been planning to
experiment with.
The feedback will be appreciated Cheers, Sergey
- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer
Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
Phone: +49.631.20575-1250
mailto:[email protected] http://www.dfki.uni-kl.de/~reuschling/
- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlO0A5UACgkQ6EqMXq+WZg/oLgCgkdpH5uRoYncVhLadg7qxjXKD
PZQAn1jxxRejVGchXXoYA08BIA3ldOKH
=ulNT
-----END PGP SIGNATURE-----