Ceph PG stuck inactive for more than 300 seconds问题分析

ceph pg stuck inactive问题分析

现象

系统中经常会报HEALTH_ERR告警,在Ceph monitor里有如下log:


2018-03-23 02:57:26.044210 7f0b0f42c700  0 log_channel(cluster) log [INF] : HEALTH_ERR; 27 pgs are stuck inactive for more than 300 seconds; 1 pgs inconsistent; 69 pgs peering; 1 pgs repair; 79 pgs stale; 27 pgs stuck inactive; 27 pgs stuck unclean; 2254 scrub errors



问题重现

重启一个Ceph OSD,通过Ceph status获取集群状态,发现报HEALTH_ERR的告警。

# systemctl restart ceph-osd@0.service
# ceph -s
cluster f01fb68c-58c6-4707-8adb-b7ac88172340
health HEALTH_ERR
147 pgs are stuck inactive for more than 300 seconds
189 pgs degraded
15 pgs peering
147 pgs stuck inactive
147 pgs stuck unclean
189 pgs undersized
recovery 138680/24168142 objects degraded (0.574%)
...
18740 active+clean
142 undersized+degraded+peered
47 active+undersized+degraded
11 peering
4 remapped+peering
client io 392 MB/s rd, 166 MB/s wr, 30878 op/s rd, 1559 op/s wr

如上面显示,有错误147 pgs are stuck inactive for more than 300 seconds,但实际我们刚重启一个OSD,pg是不会变为inactive超过300秒的。



代码分析

分析Ceph代码,pgs are stuck inactive的log是在PGMonitor::get_health中打印的:


void PGMonitor::get_health(list<pair<health_status_t,string> >& summary,
                          list<pair<health_status_t,string> > *detail,
                          CephContext *cct) const
{
...
   utime_t cutoff = now - utime_t(g_conf->mon_pg_stuck_threshold, 0); // 获取截取时间点
   uint64_t num_inactive_pgs = 0;
...
  } else {
       pg_map.get_stuck_counts(cutoff, note);  // 从pg_map里获取分类状态的统计信息
       map<string,int>::const_iterator p = note.find("stuck inactive");
       if (p != note.end())
           num_inactive_pgs += p->second;
       p = note.find("stuck stale");
       if (p != note.end())
           num_inactive_pgs += p->second;
  }

   if (g_conf->mon_pg_min_inactive > 0 && num_inactive_pgs >= g_conf->mon_pg_min_inactive) {  // 判断是否上报HEALTH_ERR
       ostringstream ss;
       ss << num_inactive_pgs << " pgs are stuck inactive for more than " << g_conf->mon_pg_stuck_threshold << " seconds";
       summary.push_back(make_pair(HEALTH_ERR, ss.str()));
  }
...
}

bool PGMap::get_stuck_counts(const utime_t cutoff, map<string, int>& note) const
{
   int inactive = 0;
   int unclean = 0;
   int degraded = 0;
   int undersized = 0;
   int stale = 0;

   for (ceph::unordered_map<pg_t, pg_stat_t>::const_iterator i = pg_stat.begin();
           i != pg_stat.end();
           ++i) { // 遍历pg_map里的所有pg信息,分类统计
       if (! (i->second.state & PG_STATE_ACTIVE)) {
           if (i->second.last_active < cutoff)
               ++inactive;
      }
       if (! (i->second.state & PG_STATE_CLEAN)) {
           if (i->second.last_clean < cutoff)
               ++unclean;
      }
       if (i->second.state & PG_STATE_DEGRADED) {
           if (i->second.last_undegraded < cutoff)
               ++degraded;
      }
       if (i->second.state & PG_STATE_UNDERSIZED) {
           if (i->second.last_fullsized < cutoff)
               ++undersized;
      }
       if (i->second.state & PG_STATE_STALE) {
           if (i->second.last_unstale < cutoff)
               ++stale;
      }
  }

   if (inactive)
       note["stuck inactive"] = inactive;

   if (unclean)
       note["stuck unclean"] = unclean;

   if (undersized)
       note["stuck undersized"] = undersized;

   if (degraded)
       note["stuck degraded"] = degraded;

   if (stale)
       note["stuck stale"] = stale;

   return inactive || unclean || undersized || degraded || stale;
}

相关的配置参数有:

OPTION(mon_pg_min_inactive, OPT_U64, 1)
OPTION(mon_pg_stuck_threshold, OPT_INT, 300)

从代码分析看,在pg状态为非active的时候,会根据last_activecutoff值来判断pg是否stuck足够长时间。

下面问题变为,pg的last_active是何时更新?如何更新的?



查看Ceph pg状态

通过命令ceph pg dump_json输出所有pg状态到文件,然后通过jd命令解析该json文件:

# ceph pg dump_json > pg-dump-json
# cat pg-dump-json | jd

命令执行时间:Mon Mar 26 14:40:26 CST 2018

从pg的json输出里找到一个示例pg如下,可以看到pg的last_active时间相比当前时间有又约7分钟职之久,完全大于配置参数mon_pg_stuck_threshold(300s)


  {
     "pgid": "3.6d",
     "version": "6979'1029",
     "reported_seq": "2413",
     "reported_epoch": "7114",
     "state": "active+clean",
     "last_fresh": "2018-03-26 13:52:56.629584",
     "last_change": "2018-03-26 13:38:28.263551",
     "last_active": "2018-03-26 13:52:56.629584",
     "last_peered": "2018-03-26 13:52:56.629584",
     "last_clean": "2018-03-26 13:52:56.629584",
     "last_became_active": "2018-03-26 13:38:28.263391",
     "last_became_peered": "2018-03-26 13:38:28.263391",
     "last_unstale": "2018-03-26 13:52:56.629584",
     "last_undegraded": "2018-03-26 13:52:56.629584",
     "last_fullsized": "2018-03-26 13:52:56.629584",
     "mapping_epoch": 7099,
     "log_start": "0'0",
     "ondisk_log_start": "0'0",
     "created": 3936,
     "last_epoch_clean": 7101,
     "parent": "0.0",
     "parent_split_bits": 0,
     "last_scrub": "6979'1029",
     "last_scrub_stamp": "2018-03-25 20:25:28.018835",
     "last_deep_scrub": "6813'1028",
     "last_deep_scrub_stamp": "2018-03-24 15:09:13.438038",
     "last_clean_scrub_stamp": "2018-03-25 20:25:28.018835",
...
  },

查看代码,Ceph pg的状态更新不是实时的,只在特定时候才会更新last_active值,对应的也有一个配置参数:OPTION(osd_pg_stat_report_interval_max, OPT_INT, 500),这个默认值500大于OPTION(mon_pg_stuck_threshold, OPT_INT, 300)

参考:

http://xiaqunfeng.cc/2017/08/01/ceph-peering%E8%BF%87%E7%A8%8B%E5%88%86%E6%9E%90/

http://tracker.ceph.com/issues/14028



解决办法

针对上面的分析,若不想在Ceph OSD重启的时候集群报HEALTH_ERR,可以通过配置mon_pg_min_inactive来去除该告警。

mon_pg_min_inactive = 0

与这个问题有关的一个讨论,参考:https://www.spinics.net/lists/ceph-devel/msg28329.html

  • 发表于 2018-03-27 08:20
  • 阅读 ( 1102 )
  • 分类:Ceph

你可能感兴趣的文章

相关问题

0 条评论

请先 登录 后评论
不写代码的码农
bruins

9 篇文章

作家榜 »

  1. bruins 9 文章
  2. ictfox 1 文章
  3. Wangyang 0 文章
  4. yy 0 文章
  5. liangfang 0 文章
  6. wangguoqin1001 0 文章
  7. fatb 0 文章
  8. 糖分 0 文章