Hi Kamil, > I was wondering if this script is advisable to use?
I haven't tried the script itself but some of the underlying commands - mergedb, etc. > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) Of course, some of the commands are obsolete. Long time ago, Nutch used Lucene index shards directly. Now the management of indexes (including merging of shards) is delegated to Solr or Elasticsearch. > I plan to use it for crawls of non-overlapping urls. ... just a few thoughts about this particular use case: Why you want to merge the data structures? - if they're disjoint there is no need for it - all operations (CrawlDb: generate, update, etc.) are much faster on smaller structures If required: most of the Nutch jobs can read multiple segments or CrawlDbs. However, it might be that the command-line tool expects only a single CrawlDb or segment. - we could extend the command-line params - or just copy the sequence files into one single path ~Sebastian On 2/2/23 01:54, Kamil Mroczek wrote:
Hi, I am testing how merging crawls works and found this script https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl. I was wondering if this script is advisable to use? I plan to use it for crawls of non-overlapping urls. I am wary of using it since it is located under "Archive & Legacy" on the wiki. But after running some tests it seems to function correctly. I only had to remove the merge command ($nutch_dir/nutch merge $index_dir $new_indexes)since that is not a command anymore. I am not necessarily looking for a list of potential issues (if the list is long), just trying to understand why it might be under the archive. Kamil