Hi,
I want to use Nutch for crawling contents and Lucene for extract and analyze
the contents of the index created by Nutch. I'm trying to extract from the
index the contents of web pages, but i don' know how to set the
NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
Hello,
Problem is partialy solved but I still write it :)
Usuing bin/nutch commands (inject, generate, fetch etc.) is working.
Only bin/nutch crawl is not
--
I have successfully setup hadoop cluster on 6 nodes (1
inghe wrote:
Hi,
I want to use Nutch for crawling contents and Lucene for extract and analyze
the contents of the index created by Nutch. I'm trying to extract from the
index the contents of web pages, but i don' know how to set the
NutchDocumentAnalyzer in my application, if i use the
Or as an option you can modify nutch to store content in the index.
Andrzej, is it bad idea, what do you think?
Best Regards
Alexander Aristov
2009/5/14 Andrzej Bialecki a...@getopt.org
inghe wrote:
Hi,
I want to use Nutch for crawling contents and Lucene for extract and
analyze
the
Hi all,
I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.
Target audience
===
I think that the Nutch
Hi, I am new to Nutch. I need to read a directory and then index the new
files present there. Is it possible with Nutch.
I applolozise if someone already posted this mesaage. But I was not
able to understand that.
Can anyone guide me in this area. Really appriciate you help on this.
Fetcher2 from 0.9 was renamed to Fetcher in 1.0. In both versions it runs
more slowly for me than the original fetcher. There's no solution yet that
I'm aware of.
Cheers...
Roger
--
From: askNutch hehehah...@126.com
Sent: Wednesday, May 06, 2009
Thank you for answer, but i have still a doubt!
Why can i read the filed content in Luke, if i load the index file created
by nutch?
So, i load in Luke the index file created by Nutch-1.0, then I can view the
fields url title host ecc, but not all field; if i click on an Edit
Botton opens a
Thanks for these information about recrawling.
I am running a recrawling operation but every time I do it, I don't get the
same results as the first crawl(different documents , not the same web
pages). So how can I handle to recrawl same pages?
Maybe fixe the property db.default.fetch.interval
Thx, I have my own heuristic quite clear... however to implement this you
need to be able to 'read' document content and analyze it. I'm (was?) under
the impression that in the scoring plugin you can NOT access the document
content.
Am I wrong ?
Also I don't fully understand why there is method
inghe wrote:
Thank you for answer, but i have still a doubt!
Why can i read the filed content in Luke, if i load the index file created
by nutch?
So, i load in Luke the index file created by Nutch-1.0, then I can view the
fields url title host ecc, but not all field; if i click on an Edit
Botton
Andrzej, great summary. I played with nutch before for web search engine,
but has not used it for a while because it has become too complicated. based
on my experience in building semantic search engine for healthcare vertical,
it think it would be benefitial to separate crawling from search
Hi Andrzej,
Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:
* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where
I'm a starter with Nutch, and just learned how to add an Plugin to my Nutch.
But still Im confused by how the plugins works.
And I wondered if i want to add an plugin which can help snatching all the
pictures like ' ooxx ', how should i do that?
Plz help!! And thans so much!!
--
View this
Hi:
In the scoring plugin, you can get document content. There is one interface
you can implement:
ScoringFilterhttp://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html,
Also you can just extend OPICScoringFilter, and this interface have two
important methods:
*void
15 matches
Mail list logo