Yes, I had read that tutorial and used it to get to where I am. It's very
confusing, alas. It's unfortunate that this is the most clear description that
I could find, because it (like others that I have found) assumes a large amount
of prior knowledge. There doesn't seem to be a clear,
I have some pdf files , data present in pdf is scanned articles and some
unicode text. I am using tika as pdf parser. but parser fails for pdf's with
images in it. is it possible to index only parsable data present in that
pdf. currently it is not indexing any data from that pdf.
Thanks.
--
View
On Mon, 23 Apr 2012 10:23:05 +0100, Ian Piper ianpi...@tellura.co.uk
wrote:
Hi all,
I have set up a process for crawling a client's website using nutch
and then creating a Solr index. I have run into a workflow problem
and
would appreciate some guidance - preferably a tutorial of some sort.
Hi Ian,
There is an older well used script [0] which we moved to the aerchive
section of the wiki to clean it up a bit.
If you browse around there you should be able to manufacture something
custom for your needs with little overhead. You may also wish to scrutinise
the current command line
Can you please post the output when the Tika parser plugin fails?
If possible also the URL if you can...?
Thank you
Lewis
On Tue, Apr 24, 2012 at 6:38 AM, nutchsolruser nutchsolru...@gmail.comwrote:
I have some pdf files , data present in pdf is scanned articles and some
unicode text. I am
It could also be due to the filesize
//Remi
On Tuesday, April 24, 2012, nutchsolruser nutchsolru...@gmail.com wrote:
I have some pdf files , data present in pdf is scanned articles and some
unicode text. I am using tika as pdf parser. but parser fails for pdf's
with
images in it. is it
Hi Lewis, thanks for the reply. Sorry I couldn't get back to you soon as I was
on vacation.
I tried out the NUTCH 1044 patch on nutch 1.4 with a test website where a jsp
page sends a 302 redirect request to another jsp page. But here I observed that
the score of redirected URL is still set
I have just started picking up Nutch for a project that I am working on that
involves building a Framework for QA teams to check site pages for broken
links, expired content, etc... I have successfully setup Nutch on a Windows
OS, and fire up an initial Crawl with a single Url in the seed file. I
I have just started picking up Nutch for a project that I am working on that
involves building a Framework for QA teams to check site pages for broken
links, expired content, etc...
I have successfully setup Nutch on a Windows OS, and fire up an initial
Crawl with a single Url in the seed file. I
9 matches
Mail list logo