Ceph pg scrub error

Ceph pg scrub error的问题定位和处理过程

现象

ceph集群报Error错误:

root@ceph0:~# ceph -s
    cluster f01fb68c-58c6-4707-8adb-b7ac88172340
     health HEALTH_ERR
            1 pgs inconsistent
            2 pgs repair
            3 scrub errors


分析

通过ceph health detail查看出错的详细信息

root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 pgs repair; 3 scrub errors
pg 1.26d is active+clean+scrubbing+deep+inconsistent+repair, acting [157,16]
3 scrub errors


ceph -w能看到如下信息:

root@ceph0:~# ceph -w
    cluster f01fb68c-58c6-4707-8adb-b7ac88172340
     health HEALTH_ERR
            1 pgs inconsistent
            1 pgs repair
            3 scrub errors
...
2018-02-11 08:18:44.756270 osd.157 [ERR] 1.26d shard 16: soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head size 4194304 != size 3997696 from auth oi 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head(57344'3128467 client.1177975.1:384455284 dirty|omap_digest s 3997696 uv 3128467 od ffffffff alloc_hint [4194304 4194304 0])
2018-02-11 08:18:44.756275 osd.157 [ERR] 1.26d shard 157: soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head size 4194304 != size 3997696 from auth oi 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head(57344'3128467 client.1177975.1:384455284 dirty|omap_digest s 3997696 uv 3128467 od ffffffff alloc_hint [4194304 4194304 0])
2018-02-11 08:18:44.756276 osd.157 [ERR] 1.26d soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head: failed to pick suitable auth object
2018-02-11 08:18:44.756336 osd.157 [ERR] repair 1.26d 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head on disk size (4194304) does not match object info size (3997696) adjusted for ondisk to (3997696)
...

从上可以看出是 osd.157 报错误,错误详细信息为:

soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head size 4194304 != size 3997696 from auth oi 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head(57344'3128467 client.1177975.1:384455284 dirty|omap_digest s 3997696 uv 3128467 od ffffffff alloc_hint [4194304 4194304 0])


也可以通过 osd.157 的log查询该错误:

/var/log/ceph/ceph-osd.157.log
...
2018-02-11 08:18:44.756268 7f4d39e93700 -1 log_channel(cluster) log [ERR] : 1.26d shard 16: soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head size 4194304 != size 3997696 from auth oi 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head(57344'3128467 client.1177975.1:384455284 dirty|omap_digest s 3997696 uv 3128467 od ffffffff alloc_hint [4194304 4194304 0])
2018-02-11 08:18:44.756273 7f4d39e93700 -1 log_channel(cluster) log [ERR] : 1.26d shard 157: soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head size 4194304 != size 3997696 from auth oi 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head(57344'3128467 client.1177975.1:384455284 dirty|omap_digest s 3997696 uv 3128467 od ffffffff alloc_hint [4194304 4194304 0])
2018-02-11 08:18:44.756275 7f4d39e93700 -1 log_channel(cluster) log [ERR] : 1.26d soid 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head: failed to pick suitable auth object
2018-02-11 08:18:44.756335 7f4d39e93700 -1 log_channel(cluster) log [ERR] : repair 1.26d 1:b64c91ef:::rb.0.10a311.6b8b4567.00000000d200:head on disk size (4194304) does not match object info size (3997696) adjusted for ondisk to (3997696)


在 osd.157 上查看该 object 的size:

root@ceph0:/var/lib/ceph/osd/ceph-157/current/1.26d_head/DIR_D/DIR_6/DIR_2# find . -name "rb.0.10a311.6b8b4567.00000000d200*"
./DIR_3/rb.0.10a311.6b8b4567.00000000d200__head_F789326D__1
root@ceph0:/var/lib/ceph/osd/ceph-157/current/1.26d_head/DIR_D/DIR_6/DIR_2# ll ./DIR_3/rb.0.10a311.6b8b4567.00000000d200__head_F789326D__1
-rw-r--r-- 1 ceph ceph 4194304 Feb 11 08:20 ./DIR_3/rb.0.10a311.6b8b4567.00000000d200__head_F789326D__1

也可以查看其它 replica 节点上的该object信息确认。


解决

上诉分析看出错误是pg中一个object的size不对,那如何解决呢?


尝试如下三种办法:

1、删除主osd pg上的object,触发pg恢复

结论:失败,继续报pg error,查看log与之前的一致

2、对比主从osd上pg的object是否一致?一致就继续尝试repair pg,不一致则选取一个替代另外的,然后尝试恢复

结论:这里pool为两副本,两个osd pg上的object是一致的

3、通过rados命令修改object的size,触发pg恢复

结论:pg修复成功

root@ceph0:~# rados -p kube-ssd truncate rb.0.10a311.6b8b4567.00000000d200 3997696
root@ceph0:~# ceph pg repair 1.26d
instructing pg 1.26d on osd.157 to repair
  • 发表于 2018-03-20 17:33
  • 阅读 ( 670 )
  • 分类:Ceph

你可能感兴趣的文章

相关问题

0 条评论

请先 登录 后评论
不写代码的码农
bruins

9 篇文章

作家榜 »

  1. bruins 9 文章
  2. ictfox 1 文章
  3. Ianasa xia 0 文章
  4. Wangyang 0 文章
  5. yy 0 文章
  6. liangfang 0 文章
  7. wangguoqin1001 0 文章
  8. fatb 0 文章