Hi, I'm designing an application that needs to extract and analyse content and metadata for a selected set of web pages. The page URLs are generated by an existing application component based on information collected in a database. After initial page analysis the application should be able to detect changes in the pages and redo the analysis, as needed. The number of pages is expected to grow gradually up to about 20 million.
The application considers certain subsets of URLs to be related and these sets should be analysed and processed together when fetched content is available. Typically, the number of URLs per subset will be from 10 to 20 and they'll often be located in the same website. For each URL subset the application will decide which links to follow and fetch. The application needs to extract information from the content, it can do this in the same pass as Nutch or in an additional parsing step. Being new to Nutch, I'd like ask input from this mailing list on what would be the best way to use and customise Nutch in this kind of scenario. Can you give any extension point documentation and pointers to other resources that would be relevant in this case? So far, I've identified the following extension points that could potentially be useful: * URLFilter, SegmentMergeFilter or ScoringFilter: could be used for selecting new links to fetch * indexing filter: analysing and processing page content * index writer: post-processing a segment and writing results Do you think these are valid in this case? Which Nutch version would you recommend using for new development projects? Version 1.9 or 2.3? Can the fetch list be dynamically changed during a crawl or can new URLs only be added for the next crawl? How should the URL subset concept relate to the segment concept in Nutch? >From the application point of view, it would be simpler to make Nutch process the subset of URLs as one segment, but this design would probably tradeoff performance if throttling is used. Does Nutch typically process just one segment at a time or can multiple segments be processed concurrently with throttling performed across all segments? Or should the concept of URL subset be implemented in the application instead and the application be made to track when content for a particular URL subset will be available? best regards, marko

