Files
check_and_reboot/README.md
2026-03-20 10:15:01 +07:00

125 lines
5.2 KiB
Markdown

# check_and_reboot
An automated network card recovery tool written in Julia that monitors network connectivity and triggers system reboot to recover from network card hang/stuck conditions.
## Overview
This project consists of two monitoring scripts that detect network failures and automatically reboot the system to recover the network card:
- **[`check_router_reboot.jl`](check_router_reboot.jl)** - Monitors router connectivity via ICMP ping to detect when the network stack becomes unresponsive
- **[`check_yiem_website_reboot.jl`](check_yiem_website_reboot.jl)** - Monitors website availability via HTTP requests as an additional network health indicator
Both scripts run continuously in the background, performing periodic health checks and automatically rebooting the system if consecutive failures exceed a configured threshold. The reboot serves to reset the network card and restore network connectivity.
## Features
- **Continuous Monitoring**: Runs indefinitely with configurable check intervals
- **Multi-attempt Verification**: Retries failed checks with backoff before declaring failure
- **State Persistence**: Maintains state in JSON file across restarts
- **Cooldown Period**: Prevents rapid repeated reboots after a reboot event
- **Cross-Platform Support**: Works on Linux, macOS, and Windows with appropriate reboot commands
- **Broadcast Notifications**: Sends system-wide notifications on events (via `wall`, `logger`, or platform equivalents)
- **Log Rotation**: Automatically limits log file to last 100 entries to prevent unbounded growth
- **Dry Run Mode**: Test configuration without triggering actual reboots
## Configuration
### Router Monitor Configuration (`check_router_reboot.jl`)
```julia
const ROUTER_IP = "192.168.88.1" # Target router IP address
const TIMEOUT_SECS = 30 # Request timeout in seconds
const ATTEMPTS_PER_CHECK = 1 # Number of ping attempts per check
const BACKOFF_BETWEEN_ATTEMPTS = 1 # Seconds between retry attempts
const FAILS_TO_REBOOT = 3 # Consecutive failures before reboot
const COOLDOWN_AFTER_REBOOT_SECS = 600 # Minimum seconds between reboots
const DRY_RUN = false # Set false to enable actual reboots
const CHECK_INTERVAL_SECS = 60 # Check interval in seconds
```
### Website Monitor Configuration (`check_yiem_website_reboot.jl`)
```julia
const URL = "https://www.yiem.cc" # Target URL to monitor
const TIMEOUT_SECS = 30 # Request timeout in seconds
const ATTEMPTS_PER_CHECK = 3 # Number of HTTP attempts per check
const BACKOFF_BETWEEN_ATTEMPTS = 60 # Seconds between retry attempts
const FAILS_TO_REBOOT = 3 # Consecutive failures before reboot
const COOLDOWN_AFTER_REBOOT_SECS = 600 # Minimum seconds between reboots
const DRY_RUN = false # Set false to enable actual reboots
const CHECK_INTERVAL_SECS = 60 # Check interval in seconds
```
## Usage
### Running Manually
```bash
julia check_router_reboot.jl
julia check_yiem_website_reboot.jl
```
### Running at System Boot (Crontab)
Add the following to root's crontab (`sudo crontab -e`):
```
@reboot /usr/local/bin/juliar /path/to/check_router_reboot.jl >> /var/log/check_reboot.log 2>&1
@reboot /usr/local/bin/juliar /path/to/check_yiem_website_reboot.jl >> /var/log/check_reboot.log 2>&1
```
**Note**: The scripts use `juliar` which is a symlink to Julia for root (separate from user's Julia installation).
### Required Dependencies
Install the required Julia packages:
```bash
julia -e 'using Pkg; Pkg.add(["HTTP", "Dates", "JSON"])'
```
## Files
| File | Description |
|------|-------------|
| [`check_router_reboot.jl`](check_router_reboot.jl) | Router ping monitor with auto-reboot |
| [`check_yiem_website_reboot.jl`](check_yiem_website_reboot.jl) | Website HTTP monitor with auto-reboot |
| [`check_and_reboot_state.json`](check_and_reboot_state.json) | State persistence file (generated) |
| [`check_router_reboot_log.txt`](check_router_reboot_log.txt) | Router monitor log file |
| [`check_website_reboot_log.txt`](check_website_reboot_log.txt) | Website monitor log file |
## State File
The state is stored in [`check_and_reboot_state.json`](check_and_reboot_state.json) with the following structure:
```json
{
"consecutive_fails": 0,
"last_reboot_datetime": "2026-03-11T10:00:00"
}
```
## Log Output
Logs are written to both the console and the respective log files with timestamps:
```
[2026-03-11T10:05:09.123] Starting check loop. Checking router 192.168.88.1 every 60 seconds.
[2026-03-11T10:05:09.456] 192.168.88.1 is reachable; resetting consecutive failure counter.
[2026-03-11T10:06:09.789] 192.168.88.1 is unreachable (last result: no response). Consecutive fails: 1/3.
```
## Reboot Commands
The scripts automatically select the appropriate reboot command based on the operating system:
- **Linux**: `sudo systemctl reboot` or `sudo reboot`
- **macOS**: `sudo shutdown -r now`
- **Windows**: `shutdown /r /t 0`
## Safety Features
1. **Cooldown Period**: After a reboot, the script waits `COOLDOWN_AFTER_REBOOT_SECS` seconds before performing another check
2. **Consecutive Failures**: Requires multiple consecutive failures before triggering a reboot
3. **Dry Run Mode**: Set `DRY_RUN = true` to test without actually rebooting