又一例TRIM导致asm磁盘数据丢失的故障

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:又一例TRIM导致asm磁盘数据丢失的故障

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

以前遇到过一个case,存储直连虚拟机,对磁盘误操作之后触发trim,导致数据被清空:ssd trim导致fdisk格式化磁盘之后无法恢复,最近再次遇到类似案例:客户错误对一块asm disk磁盘进行了格式化
mkfs


该磁盘是由6块磁盘组成了磁盘组
data

被格式化之后data磁盘组直接dismount

Tue Apr 07 18:22:31 2026
WARNING: cache read  a corrupt block: group=2(DATA) fn=261 indblk=0 disk=0 (DATA_0000) incarn=3958745085 au=605 blk=0 count=1
Errors in file /home/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_639087.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
NOTE: a corrupted block from group DATA was dumped to /home/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_639087.trc
WARNING: cache read (retry) a corrupt block: group=2(DATA) fn=261 indblk=0 disk=0 (DATA_0000) incarn=3958745085 au=605 blk=0 count=1
Errors in file /home/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_639087.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
ERROR: cache failed to read group=2(DATA) fn=261 indblk=0 from disk(s): 0(DATA_0000)
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
NOTE: cache initiating offline of disk 0 group DATA
NOTE: process _user639087_+asm1 (639087) initiating offline of disk 0.3958745085 (DATA_0000) with mask 0x7e in group 2
NOTE: initiating PST update: grp = 2, dsk = 0/0xebf5a7fd, mask = 0x6a, op = clear
Tue Apr 07 18:22:31 2026
GMON updating disk modes for group 2 at 10 for pid 28, osid 639087
ERROR: Disk 0 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 2)
Tue Apr 07 18:22:31 2026
NOTE: cache dismounting (not clean) group 2/0xE9E5571F (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 115720, image: oracle@ajjorcl1 (B000)
Tue Apr 07 18:22:31 2026
NOTE: halting all I/Os to diskgroup 2 (DATA)
WARNING: Offline for disk DATA_0000 in mode 0x7f failed.
Tue Apr 07 18:22:31 2026
NOTE: LGWR doing non-clean dismount of group 2 (DATA)
NOTE: LGWR sync ABA=15.1625 last written ABA 15.1625
Errors in file /home/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_639087.trc  (incident=309345):
ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0000" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [261] [2147483648] [0 != 1]
Incident details in: /home/app/grid/diag/asm/+asm/+ASM1/incident/incdir_309345/+ASM1_ora_639087_i309345.trc
Tue Apr 07 18:22:31 2026
List of instances:
 1
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 30)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE 
 26 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Tue Apr 07 18:22:31 2026
freeing rdom 2
Tue Apr 07 18:22:31 2026
WARNING: dirty detached from domain 2
NOTE: cache dismounted group 2/0xE9E5571F (DATA) 
SQL> alter diskgroup DATA dismount force /* ASM SERVER:3924121375 */ 
Tue Apr 07 18:22:32 2026
Sweep [inc][309345]: completed
System State dumped to trace file /home/app/grid/diag/asm/+asm/+ASM1/incident/incdir_309345/+ASM1_ora_639087_i309345.trc
Tue Apr 07 18:22:32 2026
Dumping diagnostic data in directory=[cdmp_20260407182232], requested by (instance=1, osid=639087), summary=[incident=309345].
Tue Apr 07 18:22:32 2026
NOTE: cache deleting context for group DATA 2/0xe9e5571f
GMON dismounting group 2 at 11 for pid 32, osid 115720
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0001 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0002 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0003 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0004 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0005 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2
Tue Apr 07 18:22:34 2026
Sweep [inc2][309345]: completed
NOTE: AMDU dump of disk group DATA created at /home/app/grid/diag/asm/+asm/+ASM1/incident/incdir_309345
Tue Apr 07 18:22:37 2026
NOTE: ASM client orcl1:orcl disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /home/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_504268.trc
Tue Apr 07 18:23:02 2026
SUCCESS: diskgroup DATA was dismounted
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:3924121375 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA

通过kfed分析被格式化的磁盘,随机找了一些au发现都被置空
kfed


使用lsblk查看对应磁盘是否启用了TRIM 特性
trim

基于这样的情况,基本上可以判断,该磁盘大概率已经触发了trim,数据被置空的概率非常大,最后对于镜像磁盘通过winhex查看,确认磁盘中除了基本的分区和文件系统信息之外其他都为空
kong
基于此种情况,最好的结果就是恢复该6个磁盘组磁盘中5个磁盘的数据,这样丢失数据最少1/6以上,但是也是没有办法中的办法,尽可能减少损失了.

oracle linux 8.10注意pmlogger导致空间被大量占用

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:oracle linux 8.10注意pmlogger导致空间被大量占用

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

最近在多个客户的oracle linux 8.10的机器上发现/var/log/pcp/pmlogger目录占用空间过大,比如

[root@oracledb lib]# df -h
文件系统             容量  已用  可用 已用% 挂载点
devtmpfs              16G     0   16G    0% /dev
tmpfs                 16G     0   16G    0% /dev/shm
tmpfs                 16G  202M   16G    2% /run
tmpfs                 16G     0   16G    0% /sys/fs/cgroup
/dev/mapper/ol-root   60G   60G   63M  100% /
/dev/mapper/ol-home   30G  241M   29G    1% /home
/dev/vda1           1014M  360M  655M   36% /boot
/dev/vdb1            688G  224G  430G   35% /u01
tmpfs                3.1G   12K  3.1G    1% /run/user/42
tmpfs                 60M     0   60M    0% /var/log/rtlog
tmpfs                3.1G  8.0K  3.1G    1% /run/user/0
[root@oracledb ~]# cd /var/log/pcp/
[root@oracledb pcp]# du -h -d 1
76K	./pmcd
0	./pmfind
476K	./pmie
34G	./pmlogger
0	./pmproxy
0	./sa
34G	.
[root@oracledb pcp]# cd pmlogger/
[root@oracledb pmlogger]# du -h -d 1
34G	./localhost.localdomain
108K	./oracledb
34G	.

目录里面具体类容类似:
logpcp


操作系统环境为:

[root@oracledb localhost.localdomain]# uname -a
Linux oracledb 5.15.0-206.153.7.1.el8uek.x86_64 #2 SMP Wed May 22 20:49:34 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
[root@oracledb localhost.localdomain]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.10 (Ootpa)
[root@oracledb localhost.localdomain]# cat /etc/os-release 
NAME="Oracle Linux Server"
VERSION="8.10"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.10"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:10:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.10
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.10

pmlogger 是 Performance Co-Pilot(PCP)的核心组件,用于采集、归档系统 / 应用性能指标,生成可回放的 PCP 归档日志,支撑离线追溯分析与长期性能基线管理。它通过 PMCD(Performance Metrics Collector Daemon)获取指标,按配置策略记录并维护多卷归档,配合 PCP 工具链实现全链路性能分析。
这个本身是一个系统性能指标采集监控的重要工具,但是由于某种原因导致占用磁盘空间过大,而且大部分情况下有另外的监控来取代,引起大部分情况下,可以考虑关闭相关服务,并清空相关日志,避免出现磁盘空间不足的情况

[root@oracledb ~]# systemctl status pmlogger
● pmlogger.service - Performance Metrics Archive Logger
   Loaded: loaded (/usr/lib/systemd/system/pmlogger.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2025-11-03 22:25:48 CST; 1 months 27 days ago
     Docs: man:pmlogger(1)
 Main PID: 5580 (pmlogger)
    Tasks: 1 (limit: 818103)
   Memory: 27.9G
   CGroup: /system.slice/pmlogger.service
           └─5580 /usr/libexec/pcp/bin/pmlogger -N -P -r -T24h10m -c config.default -v 100mb -mreexec %Y%m%d.%H.%M

Nov 03 22:25:46 localhost.localdomain systemd[1]: Starting Performance Metrics Archive Logger...
Nov 03 22:25:48 localhost.localdomain systemd[1]: Started Performance Metrics Archive Logger.
[root@oracledb ~]# systemctl stop pmlogger
[root@oracledb ~]# cd /var
[root@oracledb var]# rm -rf /var/log/pcp/pmlogger/*
[root@oracledb var]# du -sh
665M	.
[root@oracledb var]# systemctl disable pmlogger
Removed /etc/systemd/system/multi-user.target.wants/pmlogger.service.

磁盘空间恢复正常

[root@oracledb var]# df -h
文件系统             容量  已用  可用 已用% 挂载点
devtmpfs              16G     0   16G    0% /dev
tmpfs                 16G     0   16G    0% /dev/shm
tmpfs                 16G  202M   16G    2% /run
tmpfs                 16G     0   16G    0% /sys/fs/cgroup
/dev/mapper/ol-root   60G   27G   34G   44% /
/dev/mapper/ol-home   30G  241M   29G    1% /home
/dev/vda1           1014M  360M  655M   36% /boot
/dev/vdb1            688G  224G  430G   35% /u01
tmpfs                3.1G   12K  3.1G    1% /run/user/42
tmpfs                 60M     0   60M    0% /var/log/rtlog
tmpfs                3.1G  8.0K  3.1G    1% /run/user/0
[root@oracledb var]# 

linux异常磁盘lvm恢复操作演示

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:linux异常磁盘lvm恢复操作演示

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

有客户exsi系统被勒索病毒加密,拷贝出来磁盘文件,通过工具分析,磁盘文件均有部分被破坏,恢复工具无法自动识别
QQ20251206-001609


无法扫描到任何分区信息
QQ20251206-001942

和客户沟通确认三快盘采用的是lvm方式管理,尝试检索lvm信息
lvm1
lvm2

检索lv信息
lvm3

选择合适的lv,进行读取
lvm4

人工选择合适的磁盘作为pv
lvm5

直接查看lv中数据
lvm6

arm环境vg损坏mysql数据库恢复

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:arm环境vg损坏mysql数据库恢复

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

国庆节期间接到朋友咨询,原先在vg中的磁盘被重新pvcreate了,想恢复原磁盘中的mysql数据库
pvcreate


通过分析系统的history日志,发现操作不是简单的pvcreate,我简单梳理下操作步骤
故障之前磁盘情况

[root@0002 ~]# lsblk
NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0                    11:0    1 1024M  0 rom  
vda                   253:0    0  200G  0 disk 
├─vda1                253:1    0  600M  0 part /boot/efi
├─vda2                253:2    0    1G  0 part /boot
└─vda3                253:3    0 38.4G  0 part 
  ├─klas-root         252:0    0 34.4G  0 lvm  /
  └─klas-swap         252:1    0    4G  0 lvm  [SWAP]
vdb                   253:16   0 1000G  0 disk 
└─vdb1                253:17   0  500G  0 part 
  └─mysql-mysql--mycg 252:2    0  500G  0 lvm  /mysql

这里可以看到出来vdb磁盘一共1000G,分区vdb1 为500G,然后这500G加入到vg中并分配了lv.

vdb磁盘现状

[root@0002 mysql]# lsblk /dev/vdb
NAME                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdb                   253:16   0 1000G  0 disk 
└─vdb1                253:17   0 1000G  0 part 

Disk /dev/vdb: 1000 GiB, 1073741824000 bytes, 2097152000 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x5a6aaeee

Device     Boot Start        End    Sectors  Size Id Type
/dev/vdb1        2048 2097151999 2097149952 1000G 8e Linux LVM

这里基本上可以确定,vdb1磁盘分区从以前的500G变成了1000G(也就是说被重新分区了,后续和现场沟通确认进行了重新分区操作)


通过history日志追述大概的操作过程

  898  [2025-09-28 11:55:13][root]fdisk -l
  899  [2025-09-28 11:55:21][root]df -h
  900  [2025-09-28 11:56:41][root]lsblk
  901  [2025-09-28 11:59:44][root]fdisk /dev/vdb
  902  [2025-09-28 12:00:46][root]partprobe /dev/vdb
  903  [2025-09-28 12:00:50][root]pvresize /dev/vdb1
  904  [2025-09-28 12:00:56][root]df -h
  905  [2025-09-28 12:01:25][root]vgdisplay mysql
  906  [2025-09-28 12:01:40][root]lsblk
  907  [2025-09-28 12:02:05][root]sudo partprobe /dev/vdb
  908  [2025-09-28 12:02:10][root]pvresize /dev/vdb1
  909  [2025-09-28 12:02:27][root]sudo pvresize /dev/vdb1
  910  [2025-09-28 12:03:07][root]sudo pvcreate /dev/vdb1
  911  [2025-09-28 12:03:22][root]sudo pvscan
  912  [2025-09-28 12:03:30][root]sudo pvdisplay
  913  [2025-09-28 12:05:37][root]parted /dev/vdb
  914  [2025-09-28 12:06:11][root]pvresize /dev/vdb1
  915  [2025-09-28 12:06:15][root]lsblk
  916  [2025-09-28 12:09:48][root]lvextend -l +100%FREE /dev/mysql/mysql--mycg
  917  [2025-09-28 12:10:00][root]cd /dev/mysql/
  918  [2025-09-28 12:10:01][root]ll
  919  [2025-09-28 12:10:20][root]pwd
  920  [2025-09-28 12:10:32][root]lvextend -l +100%FREE /dev/mysql/mysql-mycg
  921  [2025-09-28 12:10:55][root]lsblk /dev/vdb

基本上可以确定9月28日先进行了fdisk分区操作,然后尝试pvresize 操作[应该不会成功,因为重新分区导致pv信息丢失],然后进行了pvcreate之后再次进行parted分区操作,再pvresize,lvextend操作[同理pv信息丢失应该不会成功],然后10月5日继续进行的部分操作

  956  [2025-10-05 08:29:27][root]umount /mysql
  957  [2025-10-05 08:29:38][root]lsof /mysql
  958  [2025-10-05 08:29:58][root]service mysqld stop
  959  [2025-10-05 08:30:02][root]umount /mysql
  960  [2025-10-05 08:30:05][root]lsof /mysql
  961  [2025-10-05 08:30:23][root]cd /
  962  [2025-10-05 08:30:25][root]umount /mysql
  963  [2025-10-05 08:30:34][root]pvcreate --force /dev/vdb1
  964  [2025-10-05 08:30:47][root]vgextend mysql /dev/vdb1
  965  [2025-10-05 08:31:02][root]df -h
  966  [2025-10-05 08:31:33][root]pvdisplay /dev/vdb1
  967  [2025-10-05 08:31:41][root]pvcreate --force /dev/vdb1
  968  [2025-10-05 08:32:11][root]lvs | grep mysql-mysql--mycg
  969  [2025-10-05 08:32:19][root]dmsetup ls | grep mysql
  970  [2025-10-05 08:32:38][root]fuser /dev/vdb1
  971  [2025-10-05 08:32:41][root]lsof /dev/vdb1
  972  [2025-10-05 08:32:50][root]pvcreate --force /dev/vdb1
  973  [2025-10-05 08:33:14][root]reboot
  974  [2025-10-05 08:36:23][root]pvcreate --force /dev/vdb1
  975  [2025-10-05 08:36:47][root]lvdisplay /dev/mapper/mysql-mysql--mycg
  976  [2025-10-05 08:36:53][root]vgextend mysql /dev/vdb1
  977  [2025-10-05 08:37:10][root]lvextend -l +100%FREE /dev/mysql/mysql--mycg

初步看,应该是先尝试umount /dev/vdb1,但是没有成功,然后直接reboot重启了主机,起来之后,进行了pvcreate[操作成功],vgextend,lvextend等操作[失败,因为vg里面的之前的pv信息已经丢失],而且之前lv无法mount成功,数据库文件/备份均在这个lv里面,而且从库很久之前没有正常同步.基于这样的情况,就一定要对vdb磁盘中数据进行恢复.查看操作系统信息,确认是arm系统
arm


由于arm系统一般工具均无法正常解析,只能让客户把磁盘挂载到x86环境进行处理,通过专业恢复工具解析,运气不错可以直接读取数据
m1

传输数据到客户服务器中,并成功启动mysql,客户测试业务没有任何问题,数据完整恢复
2

redhat系列7/8进入单用户模式

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:redhat系列7/8进入单用户模式

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

以前写过一篇文章在linux老版本中,进入单用户模式的方法:linux 4/5/6版本进入单用户模式,今天测试主流的redhat系列(测试使用OEL,没有本质区别)7和8版本中进入单用户.
主要操作步骤:
1)选择linux启动项,输入e
2)根据你的RHEL/CentOS/OEL版本,找到 linux16/linux/linuxefi等类似启动语句语句,按下键盘上的 End 键,跳到行末,添加关键词 rd.break,按下 Ctrl+x 或 F10 来进入单用户模式
3)mount 根文件系统为读写模式:mount -o remount,rw /sysroot
4)指定/sysroot为/挂载点:chroot /sysroot
5)进行需要的系统操作,比如重设root密码,修改不合适的系统配置(fstab,sysctl.conf等),然后sync同步数据
6)重启系统:reboot -f(也可以两次exit实现重启)
linux 7系列进入单用户演示
s1
s2
s3


linux 8系列进入单用户演示
s4
s5
s6