Pablo Gosse wrote:
Howdy folks.  I'm running into something strange with array_diff that
I'm hoping someone can shed some light on.

I have two tab-delimited text files, and need to find the lines in the
first that are not in the second, and vice-versa.

There are 794 records in the first, and 724 in the second.

Simple enough, I thought.  The following code should work:

$tmpOriginalGradList = file('/path/to/graduate_list_original.txt');
$tmpNewGradList = file('/path/to/graduate_list_new.txt');

$diff1 = array_diff($tmpOriginalGradList, $tmpNewGradList);
$diff2 = array_diff($tmpNewGradList, $tmpOriginalGradList);

I expected that this would set $diff1 to have all elements of
$tmpOriginalGradList that did not exist in $tmpNewGradList, but it
actually contains many elements that exist in both.

The same is true for $diff2, in that many of its elements exist in both
$tmpOriginalGradList and $tmpNewGradList as well.

Since this returns $diff1 as having 253 elements and $diff2 as having
183, it sort of makes sense, since the difference between those two
numbers is 70, which is the difference between the number of lines in
the two files.  But the bottom line is that both $diff1 and $diff2
contain elements common to both files, which using array_diff simply
should not be the case.


Hard to say what happened here. If I had to take a guess I might say that you're getting line wrapping in the middle of 183 different records.


However, when I loop through each file and strip out all the tabs:


And really since you have tab-delimited records you should be exploding on those tabs in order to get the data set. But because I'm slightly paranoid I would do it on the entire string of the file.


<?php

$str_OriginalGradList = file_get_contents('/path/to/graduate_list_original.txt');
$ary_OriginalGradList = explode(chr(9), $str_OriginalGradList);
$str_NewGradList = file_get_contents('/path/to/graduate_list_new.txt');
$ary_NewGradList = explode(chr(9), $str_OriginalGradList);


$diff1 = array_diff($ary_OriginalGradList, $ary_NewGradList);
$diff2 = array_diff($ary_NewGradList, $ary_OriginalGradList);
echo '<pre>';
var_dump($diff1);
var_dump($diff2);
echo '</pre>';

?>

foreach ($tmpOriginalGradList as $k=>$l) {
        $tmp = str_replace(chr(9), '', $l);
        $tmpOriginalGradList[$k] = $tmp;
}

foreach ($tmpNewGradList as $k=>$l) {
        $tmp = str_replace(chr(9), '', $l);
        $tmpNewGradList[$k] = $tmp;
}

I get $diff1 as having 75 elements and $diff2 as having 5, which also
sort of makes sense since there numerically there are 70 lines
difference between the two files.

I also manually replaced the tabs and checked about 20 of the elements
in $diff1 and none were found in the new text file, and none of the 5
elements in $diff2 were found in the original text file.


75 / 5 is probably the right mix. Programmatically you can check this by comparing the diffs with each list.


However, if in the code above I replace the tabs with a space instead of
just stripping them out, then the numbers are again 253 and 183.

I'm inclined to think the second set of results is accurate, since I was
unable to find any of the 20 elements I tested in $diff1 in the new text
file, and none of the elements in $diff2 are in the original text file.

Does anyone have any idea why this is happening?  The tab-delimited
files were generated from Excel spreadsheets using the same script, so
there wouldn't be any difference in the formatting of the files.


The sad truth is that this is quite possibly the root cause of your problem. I have had many many problems caused by MS Excel conversion to/from other types of data. I don't completely understand the escaping process in Excel, but double quotes have always been a problem. And occasionally it seems like Excel just barfs on a tab / comma. Why it does that is completely beyond me. I can't count the number of times that I have opened up a comma delimited file in Excel, just *looked* at the file, saved it, and when I view the source it's been mangled a bit.


Moral of the story: I don't ever use Excel to view tab or comma delimited types of data unless I have a backup someplace.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Reply via email to