Log Insight – No Events Have Been Received Recently

We have been at the tail end of our firmware and ESXi upgrades when I noticed that Log Insight had this weird warning on the Interactive Analytics page

I saw this warning before but it went away before I took a good look. I was going to look at it again later when I got an alert from Log Insight that my retention period was at 4 days!
It used to be between 20-30 days, so what gives?

I looked at the System Monitor page and I was seeing huge amount of dropped events.

I wasn’t sure what the Events Per Second was supposed to be at so it didn’t alarm me, but it was in fact quite high (20k vs 1.5k).

In Interactive Analytics I didn’t even have to put any searches to see that there was a specific set of alerts that was coming in constantly

There is a KB that matched these messages closely: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033725

The fix is to:
Enable SSH
localcli hardware ipmi sel clear
/etc/init.d/sfcbd-watchdog restart

A good way to find the exact hosts that are causing the problem is to go to interactive analytics

Search string: ipmiifcselreadentry

For the graph, choose “count of events grouped by host”

It’s helpful to do a non-time series in this case so that you can see the cumulative events for that time period versus by minute, 10 minute, etc.
In this graph, over 6 hours each of the hosts pumped 4-5 million of these events!

Though there was already an alert for Log Insight dipping below my retention period, I decided to add an alert to look for this exact issue and how to resolve.

Query: ipmiifcselreadentry
Group by hostname
Threshold: 100000 events in a single group in the last 15 minutes

Log Insight – No Events Have Been Received Recently

Related

One thought on “Log Insight – No Events Have Been Received Recently”

Leave a Reply Cancel reply