Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2009-05-14 Thread inghe
Hi, I want to use Nutch for crawling contents and Lucene for extract and analyze the contents of the index created by Nutch. I'm trying to extract from the index the contents of web pages, but i don' know how to set the NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of

Job not finished on nutch and hadoop

2009-05-14 Thread Bartosz Gadzimski
Hello, Problem is partialy solved but I still write it :) Usuing bin/nutch commands (inject, generate, fetch etc.) is working. Only bin/nutch crawl is not -- I have successfully setup hadoop cluster on 6 nodes (1

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2009-05-14 Thread Andrzej Bialecki
inghe wrote: Hi, I want to use Nutch for crawling contents and Lucene for extract and analyze the contents of the index created by Nutch. I'm trying to extract from the index the contents of web pages, but i don' know how to set the NutchDocumentAnalyzer in my application, if i use the

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2009-05-14 Thread Alexander Aristov
Or as an option you can modify nutch to store content in the index. Andrzej, is it bad idea, what do you think? Best Regards Alexander Aristov 2009/5/14 Andrzej Bialecki a...@getopt.org inghe wrote: Hi, I want to use Nutch for crawling contents and Lucene for extract and analyze the

The Future of Nutch, reactivated

2009-05-14 Thread Andrzej Bialecki
Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch

crawling and indexing in a directory

2009-05-14 Thread sandeep bonkra
Hi, I am new to Nutch. I need to read a directory and then index the new files present there. Is it possible with Nutch. I applolozise if someone already posted this mesaage. But I was not able to understand that. Can anyone guide me in this area. Really appriciate you help on this.

Re: Fetcher2 Slow

2009-05-14 Thread Roger Dunk
Fetcher2 from 0.9 was renamed to Fetcher in 1.0. In both versions it runs more slowly for me than the original fetcher. There's no solution yet that I'm aware of. Cheers... Roger -- From: askNutch hehehah...@126.com Sent: Wednesday, May 06, 2009

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2009-05-14 Thread inghe
Thank you for answer, but i have still a doubt! Why can i read the filed content in Luke, if i load the index file created by nutch? So, i load in Luke the index file created by Nutch-1.0, then I can view the fields url title host ecc, but not all field; if i click on an Edit Botton opens a

Re: Recrawl urls

2009-05-14 Thread aidahaj
Thanks for these information about recrawling. I am running a recrawling operation but every time I do it, I don't get the same results as the first crawl(different documents , not the same web pages). So how can I handle to recrawl same pages? Maybe fixe the property db.default.fetch.interval

Re: Topical/focus URL scoring

2009-05-14 Thread Raymond Balmès
Thx, I have my own heuristic quite clear... however to implement this you need to be able to 'read' document content and analyze it. I'm (was?) under the impression that in the scoring plugin you can NOT access the document content. Am I wrong ? Also I don't fully understand why there is method

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

2009-05-14 Thread Andrzej Bialecki
inghe wrote: Thank you for answer, but i have still a doubt! Why can i read the filed content in Luke, if i load the index file created by nutch? So, i load in Luke the index file created by Nutch-1.0, then I can view the fields url title host ecc, but not all field; if i click on an Edit Botton

Re: The Future of Nutch, reactivated

2009-05-14 Thread AJ Chen
Andrzej, great summary. I played with nutch before for web search engine, but has not used it for a while because it has become too complicated. based on my experience in building semantic search engine for healthcare vertical, it think it would be benefitial to separate crawling from search

Re: The Future of Nutch, reactivated

2009-05-14 Thread Mattmann, Chris A
Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2: * refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where

How to snatch Pictures by Nutch!

2009-05-14 Thread infinityhp
I'm a starter with Nutch, and just learned how to add an Plugin to my Nutch. But still Im confused by how the plugins works. And I wondered if i want to add an plugin which can help snatching all the pictures like ' ooxx ', how should i do that? Plz help!! And thans so much!! -- View this

Re: Topical/focus URL scoring

2009-05-14 Thread yanky young
Hi: In the scoring plugin, you can get document content. There is one interface you can implement: ScoringFilterhttp://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html, Also you can just extend OPICScoringFilter, and this interface have two important methods: *void