I saw this warning before but it went away before I took a good look. I was going to look at it again later when I got an alert from Log Insight that my retention period was at 4 days!
It used to be between 20-30 days, so what gives?
I looked at the System Monitor page and I was seeing huge amount of dropped events.
I wasn’t sure what the Events Per Second was supposed to be at so it didn’t alarm me, but it was in fact quite high (20k vs 1.5k).
In Interactive Analytics I didn’t even have to put any searches to see that there was a specific set of alerts that was coming in constantly
There is a KB that matched these messages closely: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033725
The fix is to:
localcli hardware ipmi sel clear
A good way to find the exact hosts that are causing the problem is to go to interactive analytics
Search string: ipmiifcselreadentry
For the graph, choose “count of events grouped by host”
It’s helpful to do a non-time series in this case so that you can see the cumulative events for that time period versus by minute, 10 minute, etc.
In this graph, over 6 hours each of the hosts pumped 4-5 million of these events!
Though there was already an alert for Log Insight dipping below my retention period, I decided to add an alert to look for this exact issue and how to resolve.
Group by hostname
Threshold: 100000 events in a single group in the last 15 minutes