ceph集群频繁报警pg error的分析与解决

ceph环境经常pg error告警问题的分析与解决

ceph生产环境经常时不时告警,基本有两种错误类型,如下:


2018-03-15 06:57:09.994612 7f6b06fb7700  0 log_channel(cluster) log [INF] : HEALTH_ERR; 1 pgs inconsistent; 1 pgs repair; 59 scrub errors

2018-03-15 10:10:10.515985 7f6b06fb7700  0 log_channel(cluster) log [INF] : HEALTH_ERR; 123 pgs are stuck inactive for more than 300 seconds; 305 pgs degraded; 123 pgs stuck inactive; 127 pgs stuck unclean; 305 pgs undersized; recovery 289852/23564836 objects degraded (1.230%); 2/168 in osds are down

ceph版本为:ceph version 11.2.0


搜索ceph monitor log,有如下信息:monitor log



# zgrep "reported failed by" /var/log/ceph/ceph-mon.ceph0.log-20180326.gz
2018-03-25 22:05:21.047780 7f0b0ec2b700 0 log_channel(cluster) log [DBG] : osd.146 10.34.57.42:6849/2558 reported failed by osd.32 10.34.57.27:6851/1163
2018-03-25 22:05:22.209535 7f0b0ec2b700 0 log_channel(cluster) log [DBG] : osd.146 10.34.57.42:6849/2558 reported failed by osd.19 10.34.57.26:6877/17321
2018-03-26 00:00:58.247810 7f0b0ec2b700 0 log_channel(cluster) log [DBG] : osd.157 10.34.57.42:6814/31504 reported failed by osd.19 10.34.57.26:6877/17321
2018-03-26 00:00:58.559369 7f0b0ec2b700 0 log_channel(cluster) log [DBG] : osd.157 10.34.57.42:6814/31504 reported failed by osd.95 10.34.57.34:6888/17517
...


搜索failed osd在出错时间点的log,有如下发现:


2018-03-26 02:57:07.198762 7f73968fb700 -1 Processor – accept open file descriptions limit reached sd = 13 errno -23 (23) Too many open files in system
2018-03-26 02:57:08.199842 7f738d7a5700 -1 filestore(/var/lib/ceph/osd/ceph-157) error (23) Too many open files in system not handled on operation 0x7f73c1a490e0 (20737458.0.0, or op 0, counting from 0)
2018-03-26 02:57:08.199859 7f738d7a5700 0 filestore(/var/lib/ceph/osd/ceph-157) unexpected error code
2018-03-26 02:57:08.199860 7f738d7a5700 0 filestore(/var/lib/ceph/osd/ceph-157) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "touch",
"collection": "1.ad6_head",
"oid": "#1:6b59896c:::rb.0.17a787.6b8b4567.0000000009f6:head#"
},
{
"op_num": 1,
"op_name": "setattrs",
"collection": "1.ad6_head",
"oid": "#1:6b59896c:::rb.0.17a787.6b8b4567.0000000009f6:head#",
"attr_lens": {
"_": 288,
"snapset": 31
}
},

...

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os/filestore/FileStore.cc: 3022: FAILED assert(0 == "unexpected error")

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f739c1ecb35]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x87f) [0x7f739be95acf]
3: (FileStore::_do_transactions(std::vector >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7f739be9c07b]
4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) [0x7f739be9c37d]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb59) [0x7f739c1f3949]
6: (ThreadPool::WorkThread::entry()+0x10) [0x7f739c1f4920]
7: (()+0x7e25) [0x7f739916be25]
8: (clone()+0x6d) [0x7f739805334d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.

重点错误信息:accept open file descriptions limit reached sd = 13 errno -23 (23) Too many open files in system

所以这里osd进程频繁重启的原因是:linux kernel的 open file descriptions limit。

查看节点上的file open max限制为:


# cat /proc/sys/fs/file-max
655360


解决办法

修改节点上文件:/etc/sysctl.conf

配置:fs.file-max=6553600

然后执行:sysctl -p

  • 发表于 2018-04-03 19:05
  • 阅读 ( 680 )
  • 分类:Ceph

你可能感兴趣的文章

相关问题

0 条评论

请先 登录 后评论
不写代码的码农
ictfox

1 篇文章

作家榜 »

  1. bruins 9 文章
  2. ictfox 1 文章
  3. Ianasa xia 0 文章
  4. Wangyang 0 文章
  5. yy 0 文章
  6. liangfang 0 文章
  7. wangguoqin1001 0 文章
  8. fatb 0 文章