Hi David,

The latest Nutch release candidate (1.1, 
http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1) includes the tika-parser 
plugin, which provides a JpegParser (see here: http://bit.ly/b0zRX8) that 
hopefully can suit your needs.

Let me know what you think.

Cheers,
Chris


On 4/10/10 6:56 AM, "Gombkötő Dávid" <madav...@gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello.

Im working on a school task, wich is to modify nutch to be able to
identify, and download jpegs , creaty a thumbnail , and index the url of
this jpegs with the other crawl result so that the web interface can
show images as well.

 At the start i found that ParserNotFound.java can do the trick for me.
I modified the constructor so that it matches the url-s end to a
pattern, and if it ends to jpeg it creates a file with the name of the
md5sum of the url and writes the url in it to a directory found in my
filesystem. Well.. this is ugly, i wanted to add the working directory
to the parsernotfound.java , but i couldnt. And to move forward with my
work, i need to know how to make my own jpeg parser as first task. After
that i would like to index my result somehow :)

So.. my question.. how can i add my jpeg parser? Or, how can i add a new
parser to the nutch system? Thanks for your awnsers.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLwIObAAoJEIJu8h6i9aAHb6AH/jegl+oqvUg8nJCJo1p/IuVx
KuWthxGn0S+qDMfXrYb+AIRpmuj2YAWQwEE9Lhw2ftSJwFqH4gf4VwmDJq8CDTto
BDX+/lOOI7ZVtKzNmDgaN2nwX0gwn0PJgKTV8BGkUbVy3McfisQ/9v9UBzhjj7f7
DTvsZN2yNyv9PUls9GSqXw9czFsuKB7PLGnssqB6a8DTgFeoLT2F8e0B9q2Tht92
eAZV2awEnnH/wNTIjfwO00YXNdvNcGANiFzz0v4CoMekSEigoRBSemtYhsYCOppo
S0OUy8SCT4A2B6sWADIQjMKgnWuLm53dkHl9D91p0zMpnCTcq5u3hjLnxgq69L8=
=M7VY
-----END PGP SIGNATURE-----



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to