Hi, CC: [email protected]
Questions like this should really go to the user@ list, you have a must better change of being helped there are there are many many eyes. On Wed, Apr 24, 2013 at 8:57 AM, <[email protected]> wrote: > > I would be really gratefull if you could provide some links on the > following topics. > 1. breaking down the nutch commands into steps (at the mo i just use one > line) > http://wiki.apache.org/nutch/CommandLineOptions http://wiki.apache.org/nutch/NutchTutorial > 2. settings such as normalization or full site indexing *.com/ or *.html > Please see http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.html http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.html http://nutch.apache.org/apidocs-1.6/index.html?org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.html There is also optional normalization and filtering permitted on many individual tasks included within the links in 1 above. > 3. resetting the crawldb (i use both mysql and solr on different machines) > - changing the crawl output directory seems to work without sql > Can you be more verbose here please I really don't understand? Do you mean resetting the fetch time for a particular URLs? I you mean completely resetting the crawldb then you would be as well dumping the entire URL list then injecting them in to a fresh crawldb. > 4. > http://lucene.472066.n3.nabble.com/parse-data-directory-not-found-after-merge-td3635615.htmlI > have this problem after starting a full crawl without topN settings and > then starting a solr crawl to the same crawldb dir, this has started the > nutch bot to search to *.html level without normalisation however it does > not complete to solr where I would like to extract text data from the html > pages > > It seems that this was never ever addressed and the thread dies out!!! Can anyone please comment on whether they are loosing the parse_data directory when merging segments? If this can be reproduced then we need to address it. I don't have time to address it today even on a test crawl, sorry. Lewis

