Hello, For the last two weeks I have been trying to convert our project's VSS repository to SVN, and last Friday I finally succeded in producing a Subversion repository, that does not deviate in it's final state from the VSS one, and has reasonable coverage of the past history. Since the part of the repository that I tried to convert has more than 10 years of data on about 30000 files, not counting branches, and, naturally, is a bit corrupted, it's impossible to ensure 100% correctness anyway.
As the patches that I added in order to achieve that goal might be useful for others, I want to share them here. Of 11 patches, 4 address memory usage issues on large repositories, 3 add enhancements to vss2svn, and 4 fix bugs. Their description is as follows: 0001-Added-an-option-to-reuse-text-files-of-the-cache-for.patch This patch adds an option, which, when activated, makes vss2svn reuse temporary files left from previous runs, thus avoiding costly calls to ssphys. It is useful during experimenting. 0002-Added-visual-progress-indication.patch Adds progress indication to most of the steps. 0003-Reimplemented-LoadVssNames-to-use-XML-Parser-direct.patch On our repository, loading the xml file via XML::Simple consumes about 1GB of memory. That is unacceptable, so this patch reimplements that step to use XML::Parser. I'm not familiar with it, so I might have missed some difficult cases, e.g. maybe handling things like "&" in names. 0004-Use-arrays-for-table-data-storage-in-order-to-avoid.patch Use array of arrays instead of an array of hashes to hold entries in MergeParentData, to save some memory. 0005-Use-a-sliding-window-limited-by-timestamp-range-to-c.patch MergePinUnpin doesn't need to load the whole table into memory at once, it's enough to hold a timestamp-limited range of records. 0006-Do-not-fetch-the-table-into-memory-in-BuildComments.patch BuildComments doesn't need to load the table at all, it's better to keep a list of records to update afterwards. Also fixes a bug in newline handling. 0007-Added-options-to-prune-orphaned-files-and-labels-fro.patch Most of orphans in that repository come from unneeded projects removed before conversion. This patch adds an option to automatically remove all orphans, except for the ones absolutely necessary for correct conversion. Doing it on the stage of computing the set of actions guarantees a valid dump file. Without removing orphans, SVN consumes about 200KB per revision (apparently, it cannot diff directories, and the /orphaned dir gets very large). 0008-Fixed-a-problem-in-chain-share-handling.patch In some cases, when a file is shared into a path, branched, then immediately reshared to another path, etc, vss2svn generates a command to copy from the current revision, and svnadmin doesn't like that. This quick fix avoids the problem by beginning a new revision on shares from a path, that was touched by the current revision. 0009-Fixed-a-bug-in-handling-restore-with-labels.patch Fixes a bug in the recent patch, which produces invalid paths if the last action in a restored project was adding a label. I'm not sure about this bit, I just thought it strange that we should completely reset the version numbers, even if they are available: @@ -474,13 +474,18 @@ sub _restore_handler { type => $row->{itemtype}, name => $row->{itemname}, parents => {}, - first_version => 1, - last_version => 1, + first_version => $gPhysInfo{ $row->{physname} }{first_version} || 1, + last_version => $gPhysInfo{ $row->{physname} }{last_version} || 1, orphaned => 1, was_binary => $row->{is_binary}, }; 0010-Recover-files-with-corrupted-history-at-branch-point.patch In the trunk version, if a physical file is lost, all files, branched from it, are not converted, despite the fact that all versions after the branch are available. This patch adds a check to re-add such files at the branch point. 0011-Support-rolling-back-during-branching.patch Branching CAN be combined with rollback, so it's not always a no-op in SVN. This patch introduces a new ROLLBACK action to handle that case. P.S. Is it by design that Subversion stores all 110000 revisions in one directory? Alexander
From 4146a4b0734836de73007e49f3a0133917e3ed47 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Fri, 23 May 2008 00:30:52 +0400 Subject: [PATCH 01/11] Added an option to reuse text files of the cache for the first steps. It allows one to avoid expensive scanning of the VSS database files during multiple attempts at converting. This option also preserves extracted versions in vssdata for the same purpose. --- script/Vss2Svn/DataCache.pm | 15 ++++++++++----- script/vss2svn.pl | 36 +++++++++++++++++++++++++++--------- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/script/Vss2Svn/DataCache.pm b/script/Vss2Svn/DataCache.pm index 04ea6c6..e90ff0a 100644 --- a/script/Vss2Svn/DataCache.pm +++ b/script/Vss2Svn/DataCache.pm @@ -15,7 +15,7 @@ our(%gCfg); # new ############################################################################### sub new { - my($class, $table, $autoinc) = @_; + my($class, $table, $autoinc, %flags) = @_; my $self = { @@ -25,6 +25,7 @@ sub new { verbose => $gCfg{verbose}, fh => undef, file => "$gCfg{cachedir}/datacache.$table.tmp.txt", + reused => 0, }; $self = bless($self, $class); @@ -35,12 +36,16 @@ sub new { $self->_delete_table(); - if ((-e $self->{file}) && !(unlink($self->{file}))) { - print "\nERROR: Could not delete existing cache file '$self->{file}'\n"; - return undef; + if (-e $self->{file}) { + if (-f $self->{file} && $flags{-reuse_data}) { + $self->{reused} = 1; + } elsif (!(unlink($self->{file}))) { + print "\nERROR: Could not delete existing cache file '$self->{file}'\n"; + return undef; + } } - if ( !open($self->{fh}, ">$self->{file}") ) { + if ( !open($self->{fh}, ">>$self->{file}") ) { print "\nERROR: Could not open file '$self->{file}'\n"; return undef; } diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 2b758eb..c5dfa19 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -119,6 +119,14 @@ sub RunConversion { # LoadVssNames ############################################################################### sub LoadVssNames { + my $cache = Vss2Svn::DataCache->new('NameLookup', 0, -reuse_data => $gCfg{reuse_cache}) + || &ThrowError("Could not create cache 'NameLookup'"); + + if ($cache->{reused}) { + $cache->commit(); + return; + } + &DoSsCmd("info -e$gCfg{encoding} \"$gCfg{vssdatadir}/names.dat\""); my $xs = XML::Simple->new(KeyAttr => [], @@ -130,9 +138,6 @@ sub LoadVssNames { my($entry, $count, $offset, $name); - my $cache = Vss2Svn::DataCache->new('NameLookup') - || &ThrowError("Could not create cache 'NameLookup'"); - ENTRY: foreach $entry (@$namesref) { $count = $entry->{NrOfEntries}; @@ -161,10 +166,14 @@ ENTRY: # FindPhysDbFiles ############################################################################### sub FindPhysDbFiles { - - my $cache = Vss2Svn::DataCache->new('Physical') + my $cache = Vss2Svn::DataCache->new('Physical', 0, -reuse_data => $gCfg{reuse_cache}) || &ThrowError("Could not create cache 'Physical'"); + if ($cache->{reused}) { + $cache->commit(); + return; + } + find(sub{ &FoundSsFile($cache) }, $gCfg{vssdatadir}); $cache->commit(); @@ -195,10 +204,15 @@ sub GetPhysVssHistory { my($sql, $sth, $row, $physname, $physdir); &LoadNameLookup; - my $cache = Vss2Svn::DataCache->new('PhysicalAction', 1) + my $cache = Vss2Svn::DataCache->new('PhysicalAction', 1, -reuse_data => $gCfg{reuse_cache}) || &ThrowError("Could not create cache 'PhysicalAction'"); - $sql = "SELECT * FROM Physical"; + if ($cache->{reused}) { + $cache->commit(); + return; + } + + $sql = "SELECT * FROM Physical ORDER BY physname"; $sth = $gCfg{dbh}->prepare($sql); $sth->execute(); @@ -1974,7 +1988,7 @@ FIELD: sub Initialize { $| = 1; - GetOptions(\%gCfg,'vssdir=s','tempdir=s','dumpfile=s','resume','verbose', + GetOptions(\%gCfg,'vssdir=s','tempdir=s','dumpfile=s','resume','verbose','reuse_cache','prompt', 'debug','timing+','task=s','revtimerange=i','ssphys=s', 'encoding=s','trunkdir=s','auto_props=s', 'label_mapper=s', 'md5'); @@ -2040,7 +2054,7 @@ sub Initialize { return 1; } - rmtree($gCfg{vssdata}) if (-e $gCfg{vssdata}); + rmtree($gCfg{vssdata}) if (-e $gCfg{vssdata} && !$gCfg{reuse_cache}); mkdir $gCfg{vssdata}; $gCfg{ssphys} ||= 'ssphys'; @@ -2148,6 +2162,10 @@ OPTIONAL PARAMETERS: INIT, LOADVSSNAMES, FINDDBFILES, GETPHYSHIST, MERGEPARENTDATA, MERGEMOVEDATA, REMOVETMPCHECKIN, MERGEUNPINPIN, BUILDACTIONHIST, IMPORTSVN + --reuse_cache : Rebuild the database, but reuse text temporary files. + May be useful if the database becomes corrupt due to + an unexpected power failure. Make sure to remove any + incomplete files beforehand. --verbose : Print more info about the items being processed --debug : Print lots of debugging info. -- 1.5.3.3
From eb89ba04e86f8ec5aaf2514e157ed1d4bee86aec Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Sat, 17 May 2008 11:36:47 +0400 Subject: [PATCH 03/11] Reimplemented LoadVssNames to use XML::Parser directly. Otherwise, it consumes too much memory on huge repositories. --- script/vss2svn.pl | 62 ++++++++++++++++++++++++++++++---------------------- 1 files changed, 36 insertions(+), 26 deletions(-) diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 0bd7cd4..14bed29 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -7,6 +7,7 @@ use Getopt::Long; use DBI; use DBD::SQLite2; use XML::Simple; +use XML::Parser; use File::Find; use File::Path; use Time::CTime; @@ -152,35 +153,44 @@ sub LoadVssNames { &DoSsCmd("info -e$gCfg{encoding} \"$gCfg{vssdatadir}/names.dat\""); - my $xs = XML::Simple->new(KeyAttr => [], - ForceArray => [qw(NameCacheEntry Entry)],); - - my $xml = $xs->XMLin($gSysOut); - - my $namesref = $xml->{NameCacheEntry} || return 1; - my($entry, $count, $offset, $name); -ENTRY: - foreach $entry (@$namesref) { - $count = $entry->{NrOfEntries}; - $offset = $entry->{offset}; - - # The cache can contain 4 different entries: - # id=1: abbreviated DOS 8.3 name for file items - # id=2: full name for file items - # id=3: abbreviated 27.3 name for file items - # id=10: full name for project items - # Both ids 1 and 3 are not of any interest for us, since they only - # provide abbreviated names for different szenarios. We are only - # interested if we have id=2 for file items, or id=10 for project - # items. - foreach $name (@{$entry->{Entry}}) { - if ($name->{id} == 10 || $name->{id} == 2) { - $cache->add($offset, $name->{content}); + my $parser = new XML::Parser(Handlers => { + Start => sub { + my ($exp, $tag, %attrs) = @_; + + if ($tag eq 'NameCacheEntry') { + $offset = $attrs{offset}; + } elsif ($tag eq 'Entry') { + $entry = $attrs{id}; + $name = ''; } - } - } + }, + Char => sub { + my ($exp, $str) = @_; + $name .= $str; + }, + End => sub { + my ($exp, $tag) = @_; + + if ($tag eq 'Entry') { + # The cache can contain 4 different entries: + # id=1: abbreviated DOS 8.3 name for file items + # id=2: full name for file items + # id=3: abbreviated 27.3 name for file items + # id=10: full name for project items + # Both ids 1 and 3 are not of any interest for us, since they only + # provide abbreviated names for different szenarios. We are only + # interested if we have id=2 for file items, or id=10 for project + # items. + if ($entry == 10 || $entry == 2) { + $cache->add($offset, $name); + } + } + }, + }); + + $parser->parse($gSysOut); $cache->commit(); } # End LoadVssNames -- 1.5.3.3
From b69bd4c602f3ec109551decd22b9fed9996551db Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Sat, 17 May 2008 12:14:51 +0400 Subject: [PATCH 06/11] Do not fetch the table into memory in BuildComments. Also fixed a bug, that improperly removed newlines, thus merging multiple comments on one line. --- script/vss2svn.pl | 29 ++++++++++++++++++++--------- 1 files changed, 20 insertions(+), 9 deletions(-) diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 7dac544..e9e77f9 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -1056,12 +1056,11 @@ sub BuildComments { $sth = $gCfg{dbh}->prepare($sql); $sth->execute(); - # need to pull in all recs at once, since we'll be updating/deleting data - $rows = $sth->fetchall_arrayref( {} ); + my @updchild = (); init_progress 'Processing', $total_count; - foreach $row (@$rows) { + while ($row = $sth->fetchrow_hashref()) { advance if ($progress++ % 1000) == 0; # technically we have the following situations: @@ -1135,21 +1134,33 @@ sub BuildComments { print "\n" if $gCfg{verbose}; foreach my $c(@$comments) { + next unless $c->{comment}; + print " $c->{version}: $c->{comment}\n" if $gCfg{verbose}; $comment .= $c->{comment} . "\n"; - $comment =~ s/^\n+//; - $comment =~ s/\n+$//; + $comment =~ s/\n+$/\n/; } if (defined $comment && !defined $row->{comment}) { + $comment =~ s/^\n+//; + $comment =~ s/\n+$//; + $comment = $prefix . $comment if defined $prefix; - $comment =~ s/"/""/g; - my $sql3 = 'UPDATE PhysicalAction SET comment="' . $comment . '" WHERE action_id = ' . $row->{action_id}; - my $sth3 = $gCfg{dbh}->prepare($sql3); - $sth3->execute(); + push @updchild, [$comment, $row->{action_id}]; } } + init_progress 'Updating', @updchild; + + my $sql3 = 'UPDATE PhysicalAction SET comment = ? WHERE action_id = ?'; + my $sth3 = $gCfg{dbh}->prepare($sql3); + + foreach my $item (@updchild) { + advance if ($progress++ % 1000) == 0; + + $sth3->execute(@$item); + } + end_progress; 1; -- 1.5.3.3
From ba24f548a75b888ecede6c41ba495818c36fa3f3 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Fri, 23 May 2008 00:30:15 +0400 Subject: [PATCH 05/11] Use a sliding window limited by timestamp range to conserve memory in MergeUnpinPinData. --- script/vss2svn.pl | 58 ++++++++++++++++++++++++++++++++++++++++------------ 1 files changed, 44 insertions(+), 14 deletions(-) diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 980074b..7dac544 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -133,7 +133,9 @@ sub RunConversion { die if $temp =~ m/^quit/i; } + $gCfg{dbh}->begin_work(); &{ $info->{handler} }; + $gCfg{dbh}->commit(); &SetSystemTask( $info->{next} ); } @@ -945,6 +947,16 @@ sub RemoveTemporaryCheckIns { 1; } + +############################################################################### +# Sliding window support +############################################################################### +sub fetch_next_row([EMAIL PROTECTED]) { + my $row = $_[1]->fetchrow_hashref(); + push @{$_[0]}, $row if $row; + return $row; +} + ############################################################################### # MergeUnpinPinData ############################################################################### @@ -958,28 +970,31 @@ sub MergeUnpinPinData { $sth = $gCfg{dbh}->prepare($sql); $sth->execute(); - # need to pull in all recs at once, since we'll be updating/deleting data - $rows = $sth->fetchall_arrayref( {} ); - - return if ($rows == -1); - return if (@$rows < 2); + $rows = []; + fetch_next_row @$rows, $sth or return; + fetch_next_row @$rows, $sth or return; my @delchild = (); + my @updchild = (); init_progress 'Processing', $total_count; - for $r (0 .. @$rows-2) { - $row = $rows->[$r]; - + while (@$rows > 1) { + $row = shift @$rows; + advance if ($progress++ % 1000) == 0; if ($row->{actiontype} eq 'PIN' && !defined $row->{version}) # UNPIN { # Search for a matching pin action - my $u; - for ($u = $r+1; $u <= @$rows-2; $u++) { + my $u = 0; + while ($u <= $#$rows) { $next_row = $rows->[$u]; + # Bail out if the actions cannot get in one commit + # to avoid quadratic performance and preserve limited window size + last if ($next_row->{timestamp} - $row->{timestamp}) > ($gCfg{revtimerange}||3600); + if ( $next_row->{actiontype} eq 'PIN' && defined $next_row->{version} # PIN && $row->{physname} eq $next_row->{physname} @@ -993,20 +1008,35 @@ sub MergeUnpinPinData { # if we have a unpinFromVersion number copy this one to the PIN handler if (defined $row->{info}) { - my $sql2 = "UPDATE PhysicalAction SET info = ? WHERE action_id = ?"; - my $sth2 = $gCfg{dbh}->prepare($sql2); - $sth2->execute($row->{info}, $next_row->{action_id}); + push (@updchild, [$row->{info}, $next_row->{action_id}]); } push (@delchild, $row->{action_id}); + last; } # if the next action is anything else than a pin stop the search - $u = @$rows if ($next_row->{actiontype} ne 'PIN' ); + last if ($next_row->{actiontype} ne 'PIN' ); + } + continue { + fetch_next_row @$rows, $sth unless ++$u <= $#$rows; } } + } continue { + fetch_next_row @$rows, $sth unless @$rows > 1; } + init_progress 'Updating', scalar(@updchild); + + my $sql2 = "UPDATE PhysicalAction SET info = ? WHERE action_id = ?"; + my $sth2 = $gCfg{dbh}->prepare($sql2); + + foreach my $item (@updchild) { + advance if ($progress++ % 1000) == 0; + + $sth2->execute(@$item); + } + &DeleteChildRecList([EMAIL PROTECTED]); end_progress; -- 1.5.3.3
From b018dd3b5ca9553f3a6b902282e469b4f22a3933 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Wed, 21 May 2008 10:36:33 +0400 Subject: [PATCH 07/11] Added options to prune orphaned files and labels from the dump. In case when the repository is prepared for conversion by destroying unnecessary branches, they tend to end up as orphans, as VSS 8 seems to be reluctant in actually deleting data files. It may take many runs of Analyze to get rid of them. Thus, it is useful to be able to prune them during conversion. Yet, sometimes orphans are moved into non-orphaned space, for instance when they are produced by recovering from an archive. This patch runs an additional pass of action processing to collect data about moves. Note also that a very large number of orphans causes SVN to consume about 200K per revision, as it seems to store a complete list of files in one folder, if even just one of them changes, (and all orphans are put into the /orphaned folder). --- script/Vss2Svn/ActionHandler.pm | 4 + script/vss2svn.pl | 153 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 150 insertions(+), 7 deletions(-) diff --git a/script/Vss2Svn/ActionHandler.pm b/script/Vss2Svn/ActionHandler.pm index 02d2735..af96c5e 100644 --- a/script/Vss2Svn/ActionHandler.pm +++ b/script/Vss2Svn/ActionHandler.pm @@ -22,6 +22,10 @@ our %gHandlers = our(%gPhysInfo); our(%gOrphanedInfo); +sub ResetState() { + %gPhysInfo = %gOrphanedInfo = (); +} + ############################################################################### # new ############################################################################### diff --git a/script/vss2svn.pl b/script/vss2svn.pl index e9e77f9..a77609c 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -1196,9 +1196,109 @@ sub DeleteChildRecList { } # End DeleteChildRecList ############################################################################### +# FindForcedOrphans +############################################################################### +sub FindForcedOrphans { + my ($forced_orphans, $total_count) = @_; + + my $sql = 'SELECT * FROM PhysicalAction ORDER BY timestamp ASC, ' + . 'itemtype ASC, priority ASC, parentdata ASC, sortkey ASC, action_id ASC'; + + my $sth = $gCfg{dbh}->prepare($sql); + + $sth->execute(); + init_progress 'Collecting orphaned moves', $total_count; + + my @moves; + + ROW: + while(my $row = $sth->fetchrow_hashref()) { + advance if ($progress++ % 1000) == 0; + + my $action = $row->{actiontype}; + + my $handler = Vss2Svn::ActionHandler->new($row); + $handler->{verbose} = $gCfg{verbose}; + $handler->{trunkdir} = $gCfg{trunkdir}; + my $physinfo = $handler->physinfo(); + + if (defined($physinfo) && $physinfo->{type} != $row->{itemtype} ) { + next ROW; + } + + $row->{itemname} = Encode::decode_utf8( $row->{itemname} ); + $row->{info} = Encode::decode_utf8( $row->{info} ); + $row->{comment} = Encode::decode_utf8( $row->{comment} ); + $row->{author} = Encode::decode_utf8( $row->{author} ); + $row->{label} = Encode::decode_utf8( $row->{label} ); + + if (!$handler->handle($action)) { + next ROW; + } + + next ROW unless $handler->{action} eq 'MOVE'; + + my $itempaths = $handler->{itempaths}; + my $info = $handler->{info}; + + my $src_id; + if ($info =~ /^\/orphaned\/_([A-Z]{8})/) { + $src_id = $1; + } + + my $tgt_id; + if ($itempaths->[0] =~ /^\/orphaned\/_([A-Z]{8})/) { + $tgt_id = $1; + } + + if ($src_id || $tgt_id) { + push @moves, [ $src_id, $info, $tgt_id, $itempaths->[0] ]; + } + } + + end_progress; + + Vss2Svn::ActionHandler::ResetState(); + + my @queue; + for my $move (@moves) { + print "Moved orphans $move->[1] to $move->[3]\n"; + + push @queue, $move->[0] + if defined $move->[0] && !defined $move->[2]; + } + + while (@queue) { + my $id = shift @queue; + next if $forced_orphans->{$id}; + + print "Forcing inclusion of $id\n"; + $forced_orphans->{$id} = 1; + + for my $move (@moves) { + push @queue, $move->[0] + if defined $move->[0] && defined $move->[2] && $move->[2] eq $id; + } + } +} + +############################################################################### # BuildVssActionHistory ############################################################################### sub BuildVssActionHistory { + my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM PhysicalAction'); + + my %forced_orphans; + if ($gCfg{no_orphaned}) { + FindForcedOrphans(\%forced_orphans, $total_count); + + if ($gCfg{prompt}) { + print "Press ENTER to continue...\n"; + my $temp = <STDIN>; + die if $temp =~ m/^quit/i; + } + } + my $vsscache = Vss2Svn::DataCache->new('VssAction', 1) || &ThrowError("Could not create cache 'VssAction'"); @@ -1218,8 +1318,6 @@ sub BuildVssActionHistory { my($sth, $row, $action, $handler, $physinfo, $itempaths, $allitempaths); - my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM PhysicalAction'); - my $sql = 'SELECT * FROM PhysicalAction ORDER BY timestamp ASC, ' . 'itemtype ASC, priority ASC, parentdata ASC, sortkey ASC, action_id ASC'; @@ -1288,10 +1386,6 @@ ROW: } } - # we need to check for the next rev number, after all pathes that can - # prematurally call the next row. Otherwise, we get an empty revision. - $svnrevs->check($row); - # May contain add'l info for the action depending on type: # RENAME: the new name (without path) # SHARE: the source path which was shared @@ -1300,6 +1394,46 @@ ROW: # LABEL: the name of the label $row->{info} = $handler->{info}; + # Drop labels if requested + next ROW if $row->{actiontype} eq 'LABEL' && $gCfg{no_labels}; + + # Drop orphaned files if requested + if ($gCfg{no_orphaned} && @$itempaths) { + $itempaths = [ grep { !($_ && /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) } @$itempaths ]; + + if ($row->{actiontype} =~ /^(ADD|COMMIT|RENAME|BRANCH|DELETE|RECOVER|LABEL)$/) { + ; + } elsif ($row->{actiontype} eq 'SHARE') { + if ($row->{info} && $row->{info} =~ /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) { + $row->{actiontype} = 'ADD'; + undef $row->{info}; + } + } elsif ($row->{actiontype} eq 'PIN') { + if ($row->{info} && $row->{info} =~ /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) { + $row->{actiontype} = 'COMMIT'; + undef $row->{info}; + } + } elsif ($row->{actiontype} eq 'MOVE') { + if ($row->{info} && $row->{info} =~ /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) { + print "WARNING: Converting orphaned MOVE into ADD - possible data loss.\n" if @$itempaths; + $row->{actiontype} = 'ADD'; + undef $row->{info}; + } elsif ([EMAIL PROTECTED]) { + $row->{actiontype} = 'DELETE'; + $itempaths = [ $row->{info} ]; + undef $row->{info}; + } + } else { + die "Unknown action type $row->{actiontype}"; + } + + next ROW unless @$itempaths; + } + + # we need to check for the next rev number, after all pathes that can + # prematurally call the next row. Otherwise, we get an empty revision. + $svnrevs->check($row); + # The version may have changed if (defined $handler->{version}) { $row->{version} = $handler->{version}; @@ -1320,6 +1454,8 @@ ROW: end_progress; + Vss2Svn::ActionHandler::ResetState(); + $vsscache->commit(); $svnrevs->commit(); $joincache->commit(); @@ -2139,7 +2275,8 @@ FIELD: sub Initialize { $| = 1; - GetOptions(\%gCfg,'vssdir=s','tempdir=s','dumpfile=s','resume','verbose','reuse_cache','prompt', + GetOptions(\%gCfg,'vssdir=s','tempdir=s','dumpfile=s','resume','verbose', + 'reuse_cache','prompt','no_orphaned','no_labels', 'debug','timing+','task=s','revtimerange=i','ssphys=s', 'encoding=s','trunkdir=s','auto_props=s', 'label_mapper=s', 'md5'); @@ -2329,6 +2466,8 @@ OPTIONAL PARAMETERS: --auto_props="c:/Dokumente und Einstellungen/user/Anwendungsdaten/Subversion/config" --md5 : generate md5 checksums --label_mapper : INI style file to map labels to different locataions + --no_orphaned : Do not generate the orphaned cache + --no_labels : Do not convert labels EOTXT exit(1); -- 1.5.3.3
From 99a7b67e5d9bc0ac1de89359c0a3de3fc67c7c77 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Sun, 18 May 2008 13:31:10 +0400 Subject: [PATCH 02/11] Added visual progress indication. Allows one to estimate time, necessary for completion of the current step. --- script/vss2svn.pl | 92 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 files changed, 84 insertions(+), 8 deletions(-) diff --git a/script/vss2svn.pl b/script/vss2svn.pl index c5dfa19..0bd7cd4 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -38,6 +38,29 @@ $VERSION =~ s/\$.*?(\d+).*\$/$1/; # get only the number out of the svn revision &DisconnectDatabase; ############################################################################### +# Progress tracking +############################################################################### + +our ($progress, $total, $progress_title); + +sub init_progress($$) { + $progress = 0; + $progress_title = $_[0]; + $total = $_[1] + 1; + print "\r$progress_title: 0% (0) "; +} + +sub advance(;$) { + my $m = $_[0]||''; + print "\r$progress_title: ".int(100*$progress/$total)."% ($progress) $m "; +} + +sub end_progress() { + advance '(done)'; + print "\n"; +} + +############################################################################### # RunConversion ############################################################################### sub RunConversion { @@ -218,14 +241,19 @@ sub GetPhysVssHistory { my $xs = XML::Simple->new(ForceArray => [qw(Version)]); + $progress = 0; while (defined($row = $sth->fetchrow_hashref() )) { $physname = $row->{physname}; + + print "\r${physname}... " if !$gCfg{debug} && ($progress++ % 1000) == 0; $physdir = "$gCfg{vssdir}/data"; my $physfolder = substr($physname, 0, 1); &GetVssPhysInfo($cache, $physdir, $physfolder, $physname, $xs); } + + print "\rCommitting... \n" unless $gCfg{debug}; $cache->commit(); @@ -564,7 +592,11 @@ sub MergeParentData { my($childrecs, $child, $id, $depth); my @delchild = (); + init_progress 'Processing', scalar(@$rows); + foreach $row (@$rows) { + advance if ($progress++ % 1000) == 0; + $childrecs = &GetChildRecs($row); if (scalar @$childrecs > 1) { @@ -580,10 +612,10 @@ sub MergeParentData { } } - foreach $id (@delchild) { - &DeleteChildRec($id); - } + &DeleteChildRecList([EMAIL PROTECTED]); + end_progress; + 1; } # End MergeParentData @@ -884,6 +916,9 @@ sub RemoveTemporaryCheckIns { ############################################################################### sub MergeUnpinPinData { my($sth, $rows, $row, $r, $next_row); + + my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM PhysicalAction'); + my $sql = 'SELECT * FROM PhysicalAction ORDER BY timestamp ASC, ' . 'itemtype ASC, priority ASC, parentdata ASC, sortkey ASC, action_id ASC'; $sth = $gCfg{dbh}->prepare($sql); @@ -896,10 +931,14 @@ sub MergeUnpinPinData { return if (@$rows < 2); my @delchild = (); + + init_progress 'Processing', $total_count; for $r (0 .. @$rows-2) { $row = $rows->[$r]; + advance if ($progress++ % 1000) == 0; + if ($row->{actiontype} eq 'PIN' && !defined $row->{version}) # UNPIN { # Search for a matching pin action @@ -934,11 +973,9 @@ sub MergeUnpinPinData { } } - my $id; - foreach $id (@delchild) { - &DeleteChildRec($id); - } + &DeleteChildRecList([EMAIL PROTECTED]); + end_progress; 1; } # End MergeUnpinPinData @@ -948,6 +985,9 @@ sub MergeUnpinPinData { ############################################################################### sub BuildComments { my($sth, $rows, $row, $r, $next_row); + + my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM PhysicalAction'); + my $sql = 'SELECT * FROM PhysicalAction WHERE actiontype="PIN" AND itemtype=2 ORDER BY physname ASC'; $sth = $gCfg{dbh}->prepare($sql); $sth->execute(); @@ -955,7 +995,10 @@ sub BuildComments { # need to pull in all recs at once, since we'll be updating/deleting data $rows = $sth->fetchall_arrayref( {} ); + init_progress 'Processing', $total_count; + foreach $row (@$rows) { + advance if ($progress++ % 1000) == 0; # technically we have the following situations: # PIN only: we come from the younger version and PIN to a older one: the @@ -1042,6 +1085,8 @@ sub BuildComments { $sth3->execute(); } } + + end_progress; 1; } # End BuildComments @@ -1059,6 +1104,23 @@ sub DeleteChildRec { } # End DeleteChildRec ############################################################################### +# DeleteChildRecList +############################################################################### +sub DeleteChildRecList { + my($idlst) = @_; + + my $sql = "DELETE FROM PhysicalAction WHERE action_id = ?"; + my $sth = $gCfg{dbh}->prepare($sql); + + init_progress 'Deleting', scalar(@$idlst); + + for my $id (@$idlst) { + advance if ($progress++ % 1000) == 0; + $sth->execute($id); + } +} # End DeleteChildRecList + +############################################################################### # BuildVssActionHistory ############################################################################### sub BuildVssActionHistory { @@ -1081,14 +1143,20 @@ sub BuildVssActionHistory { my($sth, $row, $action, $handler, $physinfo, $itempaths, $allitempaths); + my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM PhysicalAction'); + my $sql = 'SELECT * FROM PhysicalAction ORDER BY timestamp ASC, ' . 'itemtype ASC, priority ASC, parentdata ASC, sortkey ASC, action_id ASC'; $sth = $gCfg{dbh}->prepare($sql); $sth->execute(); + init_progress 'Processing', $total_count; + ROW: while(defined($row = $sth->fetchrow_hashref() )) { + advance if ($progress++ % 1000) == 0; + $action = $row->{actiontype}; $handler = Vss2Svn::ActionHandler->new($row); @@ -1175,6 +1243,8 @@ ROW: } + end_progress; + $vsscache->commit(); $svnrevs->commit(); $joincache->commit(); @@ -1207,6 +1277,7 @@ sub CreateSvnDumpfile { my($sql, $sth, $action_sth, $row, $revision, $actions, $action, $physname, $itemtype); my %exported = (); + my $total_count = $gCfg{dbh}->selectrow_array('SELECT COUNT(*) FROM SvnRevisionVssAction'); $sql = 'SELECT * FROM SvnRevision ORDER BY revision_id ASC'; @@ -1228,9 +1299,10 @@ EOSQL my $dumpfile = Vss2Svn::Dumpfile->new($fh, $autoprops, $gCfg{md5}, $labelmapper); Vss2Svn::Dumpfile->SetTempDir($gCfg{tempdir}); + init_progress 'Processing', $total_count; + REVISION: while(defined($row = $sth->fetchrow_hashref() )) { - my $t0 = new Benchmark; $revision = $row->{revision_id}; @@ -1243,6 +1315,8 @@ REVISION: ACTION: foreach $action(@$actions) { + advance if ($progress++ % 200) == 0; + $physname = $action->{physname}; $itemtype = $action->{itemtype}; @@ -1272,6 +1346,8 @@ ACTION: if $gCfg{timing}; } + end_progress; + my @err = @{ $dumpfile->{errors} }; if (scalar @err > 0) { -- 1.5.3.3
From 2cffa0de4ba78eea67224a565ff4785ccffd42e1 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Sat, 17 May 2008 11:45:38 +0400 Subject: [PATCH 04/11] Use arrays for table data storage, in order to avoid overusing memory in MergeParentData. --- script/vss2svn.pl | 30 +++++++++++++++++++++++++++--- 1 files changed, 27 insertions(+), 3 deletions(-) diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 14bed29..980074b 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -577,6 +577,29 @@ sub LoadNameLookup { } } # End LoadNameLookup +############################################################################## +# Support for using array representation of PhysicalActions +############################################################################### + +BEGIN { + our @phys_sql_fields = qw( + action_id physname version parentphys actiontype + itemname itemtype timestamp author is_binary + info priority sortkey parentdata label comment + ); + our %phys_sql_map = map { 'PA_'.$phys_sql_fields[$_] => $_ } 0..$#phys_sql_fields; + our $phys_sql_fieldspec = join(',',@phys_sql_fields); +} + +our (@phys_sql_fields, %phys_sql_map, $phys_sql_fieldspec); +use constant \%phys_sql_map; + +# Expand an array into a hash +sub expand_arr { + my $arr = shift @_; + return { map { $_[$_] => $arr->[$_] } (0..$#_) }; +} + ############################################################################### # MergeParentData ############################################################################### @@ -592,19 +615,20 @@ sub MergeParentData { # then delete the separate child objects to avoid duplication. my($sth, $rows, $row); - $sth = $gCfg{dbh}->prepare('SELECT * FROM PhysicalAction ' + $sth = $gCfg{dbh}->prepare('SELECT '.$phys_sql_fieldspec.' FROM PhysicalAction ' . 'WHERE parentdata > 0'); $sth->execute(); # need to pull in all recs at once, since we'll be updating/deleting data - $rows = $sth->fetchall_arrayref( {} ); + $rows = $sth->fetchall_arrayref(); my($childrecs, $child, $id, $depth); my @delchild = (); init_progress 'Processing', scalar(@$rows); - foreach $row (@$rows) { + foreach my $arow (@$rows) { + $row = expand_arr $arow, @phys_sql_fields; advance if ($progress++ % 1000) == 0; $childrecs = &GetChildRecs($row); -- 1.5.3.3
From aaee5c63b88cb88efca98999eaabb8470641eaaa Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Tue, 20 May 2008 20:28:22 +0400 Subject: [PATCH 08/11] Fixed a problem in chain share handling i.e. when A -> B, then B -> C etc --- script/Vss2Svn/SvnRevHandler.pm | 10 +++++++--- script/vss2svn.pl | 2 +- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/script/Vss2Svn/SvnRevHandler.pm b/script/Vss2Svn/SvnRevHandler.pm index b9e5cc6..fb23c41 100644 --- a/script/Vss2Svn/SvnRevHandler.pm +++ b/script/Vss2Svn/SvnRevHandler.pm @@ -47,6 +47,7 @@ sub _init { $self->{comment} = undef; $self->{lastcommentaction} = undef; $self->{seen} = {}; + $self->{seen_paths} = {}; $self->{last_action} = {}; } # End _init @@ -55,7 +56,7 @@ sub _init { # check ############################################################################### sub check { - my($self, $data) = @_; + my($self, $data, $itempaths) = @_; my($physname, $itemtype, $actiontype, $timestamp, $author, $comment) = @{ $data }{qw( physname itemtype actiontype timestamp author comment )}; @@ -72,7 +73,9 @@ sub check { my $last_action = $self->{last_action}->{$physname}; # in case the current action is the same as the last action - if ($actiontype eq 'SHARE' && $wasseen && $last_action eq $actiontype) { + # (and this means that we are not resharing from a newly-shared path) + if ($actiontype eq 'SHARE' && $wasseen && $last_action eq $actiontype + && !$self->{seen_paths}{$data->{info}||''}) { $wasseen = 0; } @@ -109,7 +112,8 @@ sub check { $self->{commitPending} = ($itemtype == 1 && $actiontype ne 'ADD') || ($self->{revnum} == 0); $self->{seen}->{$physname}++; - $self->{last_action}->{$physname} = $actiontype;; + $self->{seen_paths}->{$_}++ for grep { defined $_ } @$itempaths; + $self->{last_action}->{$physname} = $actiontype; @{ $self }{qw( timestamp author comment actiontype)} = ($timestamp, $author, $comment, $actiontype); diff --git a/script/vss2svn.pl b/script/vss2svn.pl index a77609c..04a22bf 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -1432,7 +1432,7 @@ ROW: # we need to check for the next rev number, after all pathes that can # prematurally call the next row. Otherwise, we get an empty revision. - $svnrevs->check($row); + $svnrevs->check($row, $itempaths); # The version may have changed if (defined $handler->{version}) { -- 1.5.3.3
From ee126ea04480c27760f1c68d8728eb0469c011ee Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Wed, 21 May 2008 20:35:32 +0400 Subject: [PATCH 10/11] Recover files with corrupted history at branch points. --- script/Vss2Svn/ActionHandler.pm | 19 +++++++++++++------ 1 files changed, 13 insertions(+), 6 deletions(-) diff --git a/script/Vss2Svn/ActionHandler.pm b/script/Vss2Svn/ActionHandler.pm index ff1c332..6730e83 100644 --- a/script/Vss2Svn/ActionHandler.pm +++ b/script/Vss2Svn/ActionHandler.pm @@ -317,16 +317,23 @@ sub _branch_handler { my $oldphysname = $row->{info}; my $oldphysinfo = $gPhysInfo{$oldphysname}; + + my $version = defined $row->{version} ? $row->{version} + : $self->{version}; if (!defined $oldphysinfo) { - $self->{errmsg} .= "Attempt to branch unknown item '$oldphysname':\n" - . "$self->{physname_seen}\n"; + if (!defined $version) { + $self->{errmsg} .= "Attempt to branch unknown item '$oldphysname':\n" + . "$self->{physname_seen}\n"; - return 0; + return 0; + } else { + # Add the file - probably it's previous history was corrupted + + $self->{action} = 'ADD'; + return $self->_add_handler(); + } } - - my $version = defined $row->{version} ? $row->{version} - : $self->{version}; # if we branch into a destroyed object, delete is the logical choice if (!defined $version ) { -- 1.5.3.3
From 700fa8ed5c7f1e1d760b3167f80c846a9b2787ef Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Wed, 21 May 2008 10:35:17 +0400 Subject: [PATCH 09/11] Fixed a bug in handling restore with labels. --- script/Vss2Svn/ActionHandler.pm | 9 +++++++-- script/vss2svn.pl | 2 +- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/script/Vss2Svn/ActionHandler.pm b/script/Vss2Svn/ActionHandler.pm index af96c5e..ff1c332 100644 --- a/script/Vss2Svn/ActionHandler.pm +++ b/script/Vss2Svn/ActionHandler.pm @@ -474,13 +474,18 @@ sub _restore_handler { type => $row->{itemtype}, name => $row->{itemname}, parents => {}, - first_version => 1, - last_version => 1, + first_version => $gPhysInfo{ $row->{physname} }{first_version} || 1, + last_version => $gPhysInfo{ $row->{physname} }{last_version} || 1, orphaned => 1, was_binary => $row->{is_binary}, }; my $newName = $row->{info}; + + if ($newName && $newName eq '/') { + print "Bad restore info name: '$newName'\n"; + undef $newName; + } undef $row->{info}; diff --git a/script/vss2svn.pl b/script/vss2svn.pl index 04a22bf..a2123ee 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -894,7 +894,7 @@ EOSQL foreach $row (@$rows) { #calculate last name of this file. Store it in $info - my $sql = "SELECT * FROM PhysicalAction WHERE physname = ? AND timestamp < ? ORDER BY timestamp DESC"; + my $sql = "SELECT * FROM PhysicalAction WHERE physname = ? AND timestamp < ? AND actiontype <> 'LABEL' ORDER BY timestamp DESC"; $sth = $gCfg{dbh}->prepare($sql); $sth->execute( $row->{physname}, $row->{timestamp} ); -- 1.5.3.3
From 95eb93f725a3aecf3e5422fb76da72d621b9b3c7 Mon Sep 17 00:00:00 2001 From: Alexander Gavrilov <[EMAIL PROTECTED]> Date: Fri, 23 May 2008 00:29:46 +0400 Subject: [PATCH 11/11] Support rolling back during branching. --- script/Vss2Svn/ActionHandler.pm | 13 +++++++++++++ script/Vss2Svn/Dumpfile.pm | 11 +++++++++-- script/vss2svn.pl | 2 +- 3 files changed, 23 insertions(+), 3 deletions(-) diff --git a/script/Vss2Svn/ActionHandler.pm b/script/Vss2Svn/ActionHandler.pm index 6730e83..9cdabe2 100644 --- a/script/Vss2Svn/ActionHandler.pm +++ b/script/Vss2Svn/ActionHandler.pm @@ -313,6 +313,8 @@ sub _branch_handler { # the old location, then create a new one with the pertinent info. The row's # 'physname' is that of the new file; 'info' is the formerly shared file. + # Upd: actually, a file can be rolled back without pinning it first. + my $physname = $row->{physname}; my $oldphysname = $row->{info}; @@ -352,6 +354,17 @@ sub _branch_handler { # parent was destroyed. if (defined $row->{parentphys}) { $oldphysinfo->{parents}->{$row->{parentphys}}->{deleted} = 1; + + my $parentinfo = \%{$oldphysinfo->{parents}->{$row->{parentphys}}}; + my $local_version = + defined $parentinfo->{pinned} ? $parentinfo->{pinned} : $oldphysinfo->{last_version}; + + if ($local_version != $version - 1) { + # Invoking rollback, which involves a commit + print "Rolling back $oldphysname as $physname from $local_version to $version\n"; + $self->{action} = 'ROLLBACK'; + $self->{version} = $version; + } } else { # since we have the "orphaned" handling, we can map this action to an diff --git a/script/Vss2Svn/Dumpfile.pm b/script/Vss2Svn/Dumpfile.pm index 1536d7c..4d3eaef 100644 --- a/script/Vss2Svn/Dumpfile.pm +++ b/script/Vss2Svn/Dumpfile.pm @@ -21,6 +21,7 @@ our %gHandlers = RENAME => \&_rename_handler, SHARE => \&_share_handler, BRANCH => \&_branch_handler, + ROLLBACK => \&_branch_handler, MOVE => \&_move_handler, DELETE => \&_delete_handler, RECOVER => \&_recover_handler, @@ -396,7 +397,8 @@ sub _branch_handler { my($self, $itempath, $nodes, $data, $expdir) = @_; # branching is a no-op in SVN - + # - unless it is a ROLLBACK + # since it is possible, that we refer to version prior to the branch later, we # need to copy all internal information about the ancestor to the child. if (defined $data->{info}) { @@ -408,7 +410,12 @@ sub _branch_handler { $gVersion{$data->{info}}->[$copy_version]; } } - } + } + + # handle rollback, which changes active revision simultaneously with branching + if ($data->{action} eq 'ROLLBACK') { + return $self->_commit_handler ($itempath, $nodes, $data, $expdir); + } # # if the file is copied later, we need to track, the revision of this branch # # see the shareBranchShareModify Test diff --git a/script/vss2svn.pl b/script/vss2svn.pl index a2123ee..4babf64 100755 --- a/script/vss2svn.pl +++ b/script/vss2svn.pl @@ -1401,7 +1401,7 @@ ROW: if ($gCfg{no_orphaned} && @$itempaths) { $itempaths = [ grep { !($_ && /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) } @$itempaths ]; - if ($row->{actiontype} =~ /^(ADD|COMMIT|RENAME|BRANCH|DELETE|RECOVER|LABEL)$/) { + if ($row->{actiontype} =~ /^(ADD|COMMIT|RENAME|BRANCH|ROLLBACK|DELETE|RECOVER|LABEL)$/) { ; } elsif ($row->{actiontype} eq 'SHARE') { if ($row->{info} && $row->{info} =~ /^\/orphaned\/_([A-Z]{8})/ && !$forced_orphans{$1}) { -- 1.5.3.3
_______________________________________________ vss2svn-users mailing list Project homepage: http://www.pumacode.org/projects/vss2svn/ Subscribe/Unsubscribe/Admin: http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org Mailing list web interface (with searchable archives): http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user