10-05-2009 09:29 AM
Hi @all,
my Home Server has two ST31500341AS drives together with two Samsung 1TB drives. Four times one TB are running as a Raid5; the remaining 500G from the seagates are running as Raid1.
The serials are: 9VS280WX and 9VS1Y6A2 - according to the firmware upte tool (which I tried) there's nothing to do!
Problem: Several time (>20) per day I see this message:
--- cut here ---
Oct 5 17:37:45 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Oct 5 17:37:45 server kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Oct 5 17:37:45 server kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 5 17:37:45 server kernel: ata3.00: status: { DRDY }
Oct 5 17:37:50 server kernel: ata3: port is slow to respond, please be patient (Status 0xd0)
Oct 5 17:37:55 server kernel: ata3: device not ready (errno=-16), forcing hardreset
Oct 5 17:37:55 server kernel: ata3: hard resetting link
Oct 5 17:37:55 server kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 5 17:37:55 server kernel: ata3.00: configured for UDMA/100
Oct 5 17:37:55 server kernel: ata3: EH complete
Oct 5 17:37:55 server kernel: sd 2:0:0:0: [sda] 2930277168 512-byte hardware sectors (1500302 MB)
Oct 5 17:37:55 server kernel: sd 2:0:0:0: [sda] Write Protect is off
Oct 5 17:37:55 server kernel: sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct 5 17:37:55 server kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
--- cut here ---
I already changed the SATA Cables; I changed the port on the controller - the error remains. And it's always one of the seagate drives; never the Samsung.
So what might be the problem here?
Any ideas?
10-05-2009 11:23 PM - edited 10-05-2009 11:24 PM
Additional information: Both drives are running Firmware SD17.
Here's the smartctl output:
--- cut here ---
smartctl 5.39 2009-06-03 15:05 [i686-pc-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST31500341AS
Serial Number: 9VS1Y6A2
Firmware Version: SD17
User Capacity: 1,500,301,910,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Oct 6 08:17:33 2009 CEST
========================
smartctl 5.39 2009-06-03 15:05 [i686-pc-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST31500341AS
Serial Number: 9VS280WX
Firmware Version: SD17
User Capacity: 1,500,301,910,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Tue Oct 6 08:19:56 2009 CEST
--- cut here ---
The drives are NOT running in a hardware raid but are attached to cheap raid controller
From lspci: RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
And now? Better buy some other drives and not spend time in these ones?
10-07-2009 09:52 AM
As far as I can tell from your log, the disk starts working again without a reboot. Is that true?
Can you do a smartctl -a on the drive to look at the various counters? It would be interesting to know if relocated count (or any other) goes up when these events happen. So that means you need to run smartctl -a at a variety of times.
There is a long thread about RAID problems with 7200.11 drives. If you have the patience, do read that thread for more ideas. And contribute to it too. http://forums.seagate.com/stx/board/message?board.
10-09-2009 12:49 AM
Hi,
yes, you are right. The system continues working after 15-30 seconds where NOTHING works! No reboot is necessary.
here is a diff of the smartctl before/after an errorfrom this morning:
--- cut here ---
server:~ # diff smartctl_sda_20091009075000.txt smartctl_sda_20091009093500.txt|more
13c13
< Local Time is: Fri Oct 9 07:50:19 2009 CEST
---
> Local Time is: Fri Oct 9 09:36:30 2009 CEST
63c63
< 1 Raw_Read_Error_Rate 0x000f 119 100 006 Pre-fail Always - 216018125
---
> 1 Raw_Read_Error_Rate 0x000f 119 100 006 Pre-fail Always - 216168139
67,68c67,68
< 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5612519
< 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1478
---
> 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5615853
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1480
73c73
< 188 Unknown_Attribute 0x0032 097 083 000 Old_age Always - 601304596624
---
> 188 Unknown_Attribute 0x0032 100 083 000 Old_age Always - 605599629457
75c75
< 190 Airflow_Temperature_Cel 0x0022 070 058 045 Old_age Always - 30 (Lifetime Min/Max 30/34)
---
> 190 Airflow_Temperature_Cel 0x0022 070 058 045 Old_age Always - 30 (Lifetime Min/Max 29/34)
77c77
< 195 Hardware_ECC_Recovered 0x001a 034 025 000 Old_age Always - 216018125
---
> 195 Hardware_ECC_Recovered 0x001a 034 025 000 Old_age Always - 216168139
--- cut here ---
last but not least: Yes I already found your thread and had some interesting minutes while reading it. In another forum somebody gave the hint to replace the cables and/or the controller. I changed them two days ago (it's a sil3124 chip now) - the error is still there.
And since it is not my favourite task to "rebuild" my server every few days I plan to throw away these Seagate disks and buy some disks from another company - maybe the Samsung 1.5 TB or the WD 1.5TB. That will cost another 200 Euro but at least the system does again what it is designed for - beeing the media center in our house.
frustrated
Andreas
10-09-2009 07:13 AM
10-09-2009 08:37 AM - edited 10-09-2009 11:52 AM
wudel: thanks for the very useful additional information.
fzabkar gave a masterful analysis. He noted that SMART register 188 suggests that you've experienced 0x91 (decimal 145) timeouts. That's a lot. I'd check my 7200.11 drives but they are not on (hey, I don't trust them yet).
I quite understand that pursuing this might be a waste of your time. But I'll ask you more about this anyway.
Do you have any ideas why the system freezes while the ATA command times out? I would imagine that all disk activity is blocked but I would think that anything that didn't need the disk (or could use the contents of the Linux kernel's buffer cache) should be able to continue. Just what activities qualify isn't obvious, but I'd guess echoing keyboard characters (at least in some windows) and moving the mouse cursor should not involve the disk. Are you sure that everything freezes?
The timeout of a command is an interesting failure. My understanding is that this means that the computer has issued an ATA command to the drive and not received a response in a "reasonable" time. I don't know what this time is, but 30 seconds would not be unreasonable.
It suggests to me that the problem has nothing to do with the physical medium of the drive. After all, failures involving the medium have well-defined error responses at the command level. Also, with "native command queuing", I would think that multiple commands could be in flight so one command not responding might not gum up the driver<->disk communications.
Do you have any dmesg output that covers one of these pauses? That should tell us what the driver is observing. It may be that it is just what you showed in the log in your first message.
The long thread I pointed at talks about how to cut down the length of time that the drive will attempt to recover from an error. There is a chance that rachetting that tiem down might affect the length of your pauses. I would be very interested in knowing if that is the case.
Why an I asking you so much? Because you are one of the few people reporting a problem like this that has been able to drill down to give us more information.
10-09-2009 11:03 AM
More questions.
Is it always the same drive freezing? Which drive is it?
Does smartctl show anything interesting about the other drive? In particular, what is the value of attribute 188.
Does the log (quoted in the first message) say anything interesting in the minute before the section you quoted? I ask because your log covers only the 10 second period when the driver is trying to recover. That does not include when the drive got into trouble.
Your controller seems to be (from lspci): RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
I have a vague recollection of problems with this chip and Linux. Google helped me find this, but the symptoms might differ from yours. Note: this particular Dave Jones is a serious kernel hacker. http://www.codemonkey.org.uk/2009/01/20/sata-disas
There are other Google hits but none jumped at me. More careful Googling might well be rewarded.
I'm too lazy to look into this, but the controller datasheet might have useful information, perhaps around "watchdog" http://www.siliconimage.com/docs/SiI-DS-0103-D.pdf
Have you tried a different controller?
10-09-2009 11:43 AM
10-09-2009 12:25 PM - edited 10-09-2009 03:21 PM
fzabkar: it could well be what you say. It would be worth trying to figure that out.
I have not even decoded the information that we already have. So I'll make an attempt.
Here's an interesting description of the kernel libata messages http://ata.wiki.kernel.org/index.php/Libata_error_
Oct 5 17:37:45 server kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
The 0x values probably say little (except that nothing is interestingly wrong). The action 0x2 is ATA_EH_SOFTRESET
Oct 5 17:37:45 server kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
FLUSH CACHE EXT command [thanks fzabkar]
Oct 5 17:37:45 server kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
The output taskfile says: status 40, error 00, number of sectors 1, LBA 0xC24f01, HOB error 00, HOB number of sectors 0, HOB LBA 0, Device/head A0, error mask 0x4 AC_ERR_TIMEOUT
Oct 5 17:37:45 server kernel: ata3.00: status: { DRDY }
That status means "Device ready. Normally 1, when all is OK"
10-09-2009 12:30 PM
Can you correlate the lockups with smartd? Have a look at http://www.mail-archive.com/linux-ide@vger.kernel.
That's pretty old so it probably doesn't apply. But it does involve the same controller and similar lockups, I think.
©2012 Seagate Technology LLC