Posts Tagged ‘to’

Here was my situation with my unstable main home computer, the thing happened between thursday evening and monday morning randomly.  It could be reproduced sometimes with specific processes.  My last major hardware upgrade consisted of a BIOS upgrade, new HDDs, more RAM, another DVD drive, a new CPU and a new graphics card.  At that exact moment, I went into a Windows 7 upgrade from XPSP3, tried the x64 version, crashed for no specific reason so I used x86 W7 for months before trying again and having a successful install.  Note that BSODs did happen with the x86 version but they were rare.

So much happened and since I just upgraded the system, it was very hard to see if it was related to hardware, Windows or software.

I had a heavily used Win7 x64 computer doing persistently 0x0000003B BSODs and I looked so long for a solution since I tried pretty much all I could think about :

  • Doing a repair upgrade install fixed some causes of that BSOD (I could reproduce it by trying to install Office 2010!)
  • Flushing the folder SoftwareDistribution while WU service is stopped (not sure that works in W7 to reinstall all updates…)
  • Memtest86 found an errored stick 3 months ago, removed it. (used to do 0x0000000A errors (or 1A)
  • chkdsk /f fixed some errors, but nothing to fix it
  • I used smartmontools to check HDD life, there’s like 15% left, no errors.
  • I unplugged weirdly behaving speakers and the APC UPS data cable to dismiss them as possible causes, no changes.
  • I updated graphic drivers
  • sfc /scannow reported errors it couldn’t fix, at the beginning but it’s fine since the repair reinstall.
  • Swapped power supply and memory with known working ones, to test
  • Tried intensive work from a Linux LiveCD without issues
  • Unplugged all hard drives and DVD drives, except the system HDD, still happened
  • Tried 3 times a parallel install of W7, BSODed 3 times…same one same code.
  • Tried different manual settings from the BIOS about memory voltage, virtualization, memory speed, dual channel or not.
  • …I probably forget something

See at the end of the post for more infos about the whole setup and error messages.

Here’s some details from BlueScreenView :

Here’s what I finally dove into to fix this issue :

  • I took the time to understand how to interpret and decode BSODs to get more informations from this link
  • Since I noted a lot of VISTA_DRIVER_FAULT in the logs, I thought I could double-verify drivers and indeed, I used 1 Vista driver and 1 W7 x86 driver while there was a x64 one.  NOTE : Don’t rely on your motherboard maker for drivers, go on the chipset/integrated device manufacturer’s website for specific drivers that are supposed to be way more up-to-date (in my case, went to nvidia.com to update my chipset’s)  Although, this didn’t fix anything.
  • Analyzed eventvwr logs where there wasn’t much helpful data
  • Checked for malware in files, registry, startup : none to be seen
  • Double-checked BSOD informations, found the faulty processes and blocked them from any execution, those were Windows Media Player Sharing Service and the Search Indexer (Windows services).  No more recurrent 0x3B bluescreens since that time!

Conclusions

  • It isn’t hardware related since Linux has no issues
  • Reinstalling fails because it detects another Windows install, tries to access its hive which may be corrupted. [theory]
  • Such a corrupted hive could explain non-random crashes such as those from Internet Explorer, .NET Framework or Office 2010 install.
  • It could be the combination of an updated BIOS and a new CPU that causes instability into drivers or Windows’ kernel (note that the SP1 isn’t out yet!) [theory]
  • What could have been damaging the system is the faulty memory that has been removed since.  I should try a reinstall without the actual system drive plugged in, but I am lazy since everything is fine now.

Here are some annexes :

My hardware setup :

750w PSU
ASUS M2N-SLi DELUXE with latest stable bios version (not the beta that supports AM3 socketted CPUs)
AMD Phenom X4 9850BE
5 hard drives for a total of 3.14 Tb, system is on a 250Gb one
Radeon 4890 graphic card
SoundBlaster Audigy 4ZS
2 DVD drives one from Pioneer and one from LG
Phones are often hooked on the system, a Blackberry Bold 9000 and a Motorola Milestone via its dock
Saitek Cyborg keyboard
Microsoft mouse
Monitor is a Samsung 225BW
A PCI network card DFE538TX I think, because the onboard NICs are known to do freegames
The computer is protected from surges and such by a 1300VA UPS from APC

———————————

Here are some error logs I got from BSODs :

SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the bugcheck

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 :

//
// MessageId: STATUS_ACCESS_VIOLATION
//
// MessageText:
//
// The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
//
#define STATUS_ACCESS_VIOLATION          ((NTSTATUS)0xC0000005 L)
PROCESS_NAME:  mscorsvw.exe

STACK_TEXT:
nt!HvpFindFreeCellInThisViewWindow+0x45

CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0x3B
PROCESS_NAME: wmpnetwk.exe
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from 0000000000000000 to fffff8000314c855
STACK_TEXT:
fffff880`06da66e0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!HvpFindFreeCellInThisViewWindow+0x45


J’arrivais pas à mounter une partition systeme linux qui est du ext3 a cause d’une read error…why? Je sais pas, mais voici ce que j’ai appris avec des avis de diagnostic.

[21:48:18] <DrMax> vn0 : essaye smartmontools et smartctl avant
[21:48:34] <DrMax> sudo smartctl -t long /dev/tondisque
[21:48:47] <DrMax> ça va lancer le low-level diagnostic de S.M.A.R.T.
[21:48:57] <DrMax> ça donne rien de fsck si ton disque est en train de mourrir
[21:49:02] <DrMax> ça va juste empirer
[21:49:07] <vn0> avant fsck?
[21:49:09] <vn0> ok
[21:49:11] <DrMax> oui!
[21:49:22] <vn0> hum je lai pas sur le livecd
[21:49:23] <DrMax> le low-level test est fait par le HD lui-même
[21:49:30] <vn0> juste un truc smartdimmer
[21:49:34] <DrMax> sudo apt-get install smartmontools
[21:49:47] <DrMax> t’as quand même teh interwebs ?
[21:49:52] <vn0> y
[21:50:04] <DrMax> le low-level test est fait par le HD lui-même et c’est “off-line”
[21:50:09] <DrMax> tu peux continuer à travailler après
[21:50:13] <vn0> neat
[21:50:23] <DrMax> quand tu vas le lancer, il donne un estimé du temps que ça va prendre
[21:50:24] <DrMax> et
[21:50:31] <DrMax> sudo smartctl -a /dev/tondisque
[21:50:36] <DrMax> ça va montrer le status
[21:50:39] <vn0> wtf ca installe postfix
[21:50:55] <DrMax> ouais, pour te notifier que ton HD saigne du cul par mail
[21:51:04] <DrMax> t’as qu’à mettre no config pis c’est tout
[21:51:10] <DrMax> (c’est nouveau, ça faisait pas ça avant)
[21:51:11] <vn0> done that
[21:51:25] <DrMax> « my anus is bleeding »
[21:51:28] <DrMax> « my anus is bleeeeeeeeding »
[21:51:31] <DrMax> lol
[21:51:48] <DrMax> donc quand tu vas lancer le long test
[21:51:55] <DrMax> < DrMax > sudo smartctl -t long /dev/tondisque
[21:51:59] <DrMax> il va te donner un estimé de temps
[21:52:10] <vn0> deja le -a jai un output
[21:52:21] <DrMax> tant que le test sera pas fini, il apparaît pas dans -a
[21:52:41] <vn0> ok c c ala partir “not_testing”
[21:53:09] <vn0> 61 menutes
[21:53:11] <vn0> pas si pire
[21:53:13] <DrMax> quand ton long test va compléter, tu vas voir :
[21:53:21] <DrMax> SMART Self-test log structure revision number 1
[21:53:22] <DrMax> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
[21:53:22] <DrMax> # 1 Extended offline Completed without error 00% 17 –
[21:53:26] <DrMax> (mettons, c’est mon HD)
[21:53:34] <DrMax> tu vas peut-être avoir des erreurs
[21:53:42] <DrMax> si ton HD rapporte des erreurs, c’est le temps de le changer
[21:53:43] <vn0> avec le -a ca?
[21:53:46] <DrMax> oui
[21:53:52] <vn0> okey tks
[21:53:56] <DrMax> tant que le long test sera pas complété, ça apparaîtra pas
[21:53:58] <vn0> je not ele log, c fort utile
[21:54:14] <vn0> si y na pas, fsck?
[21:54:23] <DrMax> oui, sinon c’est juste un prob de fs
[21:54:32] <vn0> k
[21:55:05] <DrMax> sinon… ben payes toi un nouveau HD
[21:55:07] <DrMax> il est foutu ;)
[21:55:27] <vn0> ehe no biggie.thanks @ backups
[21:55:57] <DrMax> s’il est foutu, tu vas voir, genre #1 extended offline <raison de l’erreur> completed 65% <lifetime> <secteur de l’erreur>
[21:56:08] <DrMax> c’est pas supposé être un biggie non plus
[21:56:33] <DrMax> mais c’est mieux que tu détecte le fail du HD lui-même au lieu d’en beurrer plus épais en réparant un FS sur un hd qui sait plus tenir ses données
[22:00:09] <vn0> indeed
[22:07:07] <DrMax> si tu reboot ça cancelle le test
[22:07:21] <DrMax> ça va montrer, ex “cancelled by host” ou quelque chose comme

sidenote : normalement tu fais un safedd pour faire une image de ton drive dans un autre fichier, tu travailles sur l’image et tu as un backup boot sector, backup superblock…safedd ignore les read error (il remplace les error par des null) ca permet de faire une image disque travaillable. ca fait une copie byte a byte du drive

 

Note 21-02-2011 : on peut aussi utiliser le dd normal avec les options conv=noerror,sync