How often is retry frequency, and can it be customized?

I am monitoring a lot of PDFs, and roughly 10% of them throw a parsing error when they are checked. I’m using cloud monitoring, and I’ve noticed that the monitors seem to automatically retry 20 minutes after failing (at which point they usually succeed).

Is this a specific setting? Is there a way to customize this?

Thanks,
Micah

are the errors transient or permanent? one of the most common cause of parsing error is the site serving an error page instead of the requested pdf file.

the retries are automatic and not customizable. what did you need to accomplish?

Thanks for getting back to me

Some info about my current setup:
I have a list of pdf urls that I’m checking once per day. As I mentioned, I’m experiencing an average failure rate of 10% per day (on the first check). Of all the errors through the last four days, only 25% percent of my pdf monitors failed more than once (on the first check), so although there are some repeat offenders, most of the ones the ones that failed in the last 4 days only failed once, so it seems pretty random.

The retry rate seems to be around the same, of the 10% that fail the first check, only around 10% will fail on the retry, though my sample size isn’t huge enough to be sure. But retries usually work, is my point.

95% of the failures were E_PDF_PARSE, and the rest were EREQUEST.

To be clear, it doesn’t seem to be an ongoing issue, just that a random selection of around 10% of my pdf monitors will fail their initial check every day.

I am not really concerned about the fact of the failures, just presenting all this info if you think it might be an error on Distill’s end.


what did you need to accomplish?

Since I am getting so many errors, I am wary of alert fatigue. I have a few suggestions for potential improvements to the error handling:

  1. The ability to include the same action types as for monitor alerts. Right now you can only do a generic web hook, so it can’t be directly used with slack/discord/teams.
  2. A “sample rate” for error messages. Right now I have the error settings turned all the way down (2 minute delay between error messages, and alert after 1 message). As such, every day I get a notice saying 1 monitor had an error, and then 2 minutes later I get a notice saying the other ~10% had an error. If I had a 2 minute sample rate, then it would wait 2 minutes after the first error, collect all errors in that time, and send them all together in one message.
  3. Customize when error notices happen in relation to retries. As I said, I am okay dealing with the error messages, but I would prefer to be notified only if the retries also encounter an error. If a ~10% initial failure rate is unavoidable, then I’d prefer to only know the list of monitors that failed on their first retry.

Alternatively, can you provide suggestions for how I can improve my setup, or is there some feature I don’t know about that would help me capture any changes OR errors to the sites/pdfs I am monitoring?

Thank you for your time!

thanks for the response.

intermittent errors in cloud are expected because websites can block requests

makes sense. we plan to support error actions for other channels too in the future.

this setting will send a notification as soon as an error occurs. is that intentional?

  1. “No. of consecutive errors to trigger notifications” will trigger an error action only after those many consecutive checks errored for a specific monitor.
  2. “Minimum time interval between notifications (in minutes)” is like a buffer. distill will batch all errored-out monitors within this interval, aggregate them, and trigger actions at the end of the interval. higher the value, lower the number of alerts.

an exception is “webhook call” action. webhook actions are not batched. our hypothesis is that raw webhooks will be used for programmatic usage and may need to respond to the errors as soon as they are encountered.

i will give it some more thought. it is related to the “minimum time interval…” setting?

My apologies, I misunderstood the “No. of consecutive errors to trigger notifications” setting. I was assuming that was for all monitors, but it makes much more sense that it would be on a per-monitor basis. That likely will help me cover most of my use cases, as I wouldn’t want to be notified unless a PDF fails a retry or two. However, there may be cases where one might want to be immediately notified for a website monitor error (particularly with a macro), and setting “No. of consecutive errors to trigger notifications” to 2 or 3 would cause a delay in being notified about a failing website macro.

As to the “sample rate”, that is becoming more desirable with my current setup. To recap, I am monitoring a large number of PDFs, and they’re all set to monitor at the same time every day. Even though they all trigger at the same time, as soon as a single monitor errors (as my current threshold is set to 1), it sends a notification for the single monitor. When the monitor error interval passes, only then do all the other error messages come though, despite all being scheduled at the same time.

Adding a delay after the error threshold is reached until the error message is sent would ensure monitors scheduled for the same time will have their error messages bundled in the same alert. It’s tricky to say what a good delay would be, as you obviously don’t want it to be too long to delay important notifications, but perhaps a customizable setting would be beneficial. I would probably use 1 minute, as that’s the minimum cron interval

You can see this behavior if you set up a couple monitors that will instantly fail, and trigger them all at once with an error threshold of 1. Only the first monitor will be included in the error message, and then all the rest will come after the error notification interval has passed. It’s not a huge deal, but it causes twice as many error messages for monitors that run at the same time.

Thanks for your time,
Micah

Here’s an example of what our notification feed normally looks like for the PDFs:

A notification when a first monitor errors, and then {Minimum time interval between notifications} minutes later, the rest of the failing monitors are sent. Then after the retry delay, one notification, and then all the rest in a group after.

in most cases that we have, the tolerance is quite high specially when check interval is high. this is the reason the default value is set to 5. it removes the noise due to transient errors that are not critical. setting it to 1 is highly discouraged because of the nature of how sites handle requests.

the “minimum interval” is like a debouncing logic. the first action is taken as soon as the condition is met (the rising edge) and the next action is taken after the interval (on the falling edge). we can offer an option to trigger the first notification on the falling edge too in case it is okay to delay the first error notification.

note that the webhook error action is not debounced. it is triggered in realtime so that one can build a custom logic to handle custom needs.