Skip to content

Troubleshooting Graylog Problems

Posted on:July 10, 2022
· 2 min read

Scenario: Queue size is big, no messages shown, ElasticSearch shows flood warnings

What’s happening

Graylog has a journal — a queue where incoming messages wait to be processed before being written to ElasticSearch. Messages are kept for a maximum of 12 hours or 5 GB, whichever comes first. The threshold values can be changed in graylog.conf:

message_journal_max_age = 12h
message_journal_max_size = 5gb

The flood warnings are ElasticSearch’s doing. ES monitors disk usage and enforces three watermarks:

WatermarkDefaultEffect
low85%ES stops allocating new shards to the node
high90%ES starts relocating shards away from the node
flood stage95%ES marks all indices read-only

When the flood stage is hit, ES sets index.blocks.read_only_allow_delete = true on every index. Graylog can no longer write messages, so they pile up in the journal. The journal grows. Eventually it hits its own size or age limit and starts dropping messages.

Fix

Step 1. Free up disk space on the ElasticSearch nodes — delete old indices, remove unnecessary files, or expand storage.

To see which indices exist and how large they are:

curl -X GET "http://localhost:9200/_cat/indices?v&s=store.size:desc"

To delete an old index:

curl -X DELETE "http://localhost:9200/<index-name>"

Step 2. Once there’s enough free space, remove the read-only block from all indices. ElasticSearch does not lift this automatically.

curl -X PUT "http://localhost:9200/_all/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.blocks.read_only_allow_delete": null}'

Step 3. Verify Graylog resumes writing. The journal size should start dropping as the backlog drains. Check the journal state in System → Overview in the Graylog web interface.

If Graylog is still stuck after lifting the read-only block, restart the Graylog service to force it to re-establish its ES connection:

sudo systemctl restart graylog-server