一次因磁盘扇区损坏引起osd 出现down的问题

当osd出现down的状态,日志信息显示为:

f049a9f5700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f049a9f5700 time 
os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

通过assert(allow_eio || !m_filestore_fail_eio || got != -5),非allow_eio且配置中fail_eio为true时,若有IO error则assert fail。

分析osd 所对应的磁盘信息:

# dmesg -T | grep sdh
sd 0:2:7:0: [sdh] 3904294912 512-byte logical blocks: (1.99 TB/1.81 TiB)
sd 0:2:7:0: [sdh] Write Protect is off
sd 0:2:7:0: [sdh] Mode Sense: 1f 00 00 08
sd 0:2:7:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdh: unknown partition table
sd 0:2:7:0: [sdh] Attached SCSI disk
 XFS (sdh): Mounting V4 Filesystem
 XFS (sdh): Ending clean mount
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294168
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294400
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294400
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294168
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294400
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294400
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294352
 Buffer I/O error on device sdh, logical block 1100294352
 Buffer I/O error on device sdh, logical block 1100294353
 Buffer I/O error on device sdh, logical block 1100294354
 Buffer I/O error on device sdh, logical block 1100294355
 Buffer I/O error on device sdh, logical block 1100294356
 Buffer I/O error on device sdh, logical block 1100294357
 Buffer I/O error on device sdh, logical block 1100294358
 Buffer I/O error on device sdh, logical block 1100294359
 Buffer I/O error on device sdh, logical block 1100294360
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294405
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294405
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh]
 sd 0:2:7:0: [sdh] CDB:
 end_request: I/O error, dev sdh, sector 1100294405

可以看到I/O Error信息,截取一段信息进行分析

badblocks -v -s -b 512 -o /root/badblocks.txt /dev/sdh 1100300000 1100290000
Checking blocks 1100290000 to 1100300000
Checking for bad blocks (read-only test): done
Pass completed, 8 bad blocks found. (8/0/0 errors)
[root@host20 ~]# cat badblocks.txt
1100294400
1100294401
1100294402
1100294403
1100294404
1100294405
1100294406
1100294407

发现1100294400 -4407被检测出来。但1100294352 – 1100294360 却没有(执行badblcoks命令仅使用了读模式检测)

发表评论

您的电子邮箱地址不会被公开。