⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@gareth-allan
Copy link
Contributor

@gareth-allan gareth-allan commented Feb 4, 2026

Description

This PR automates the refreshing of the metadata on the Digital Letters event record data in S3 in the corresponding AWS Glue table. The infrastructure to record events to S3 and to set up the Glue table was added in #190.

This is achieved by adding a Step Function that runs the MSCK REPAIR TABLE event_record command (docs) against the DL environment's reporting database. As the event_record table is already configured to point at the correct S3 location this causes the metadata to be refreshed and any new or updated partitions to be detected.

image (1)

Changes:

  • Added a Step Function State Machine, "metadata refresh", which runs the MSCK REPAIR TABLE command against the event_record table. This is based on the housekeeping function in the reporting domain.
  • Added an EventBridge Scheduler to trigger the metadata refresh function on an hourly basis (at 10-past, between 6am and 10pm). This is the approach used for the housekeeping step function in the reporting domain. The timings are the same as used for the lambda which runs the MSCK REPAIR TABLE command in core
  • Added CloudWatch alarms to alert when the metadata refresh function is aborted, fails, or times out. This matches the alerting that is in place for the housekeeping step function in the reporting domain.
  • Added the reasonCode and reasonText columns to the event_record table and ensured the report-event-transformation lambda maps them to the flattened object it produces
  • Added conditions to all SQS IAM policies that allowed EventBridge to send events to the queue, so they are restricted to only allow events from the expected rule(s)

Context

This is required as the solution implemented in #190 did not automatically import the events recorded in S3 into the Glue table, so Athena queries would not return any data unless the MSCK REPAIR TABLE command was run manually. Automating this refresh means that new events will become available to Athena queries on a regular basis.

Validation

Verifying that Events are (Eventually) Visible in Athena Without Manual Intervention

Sent a uk.nhs.notify.digital.letters.print.pdf.analysed.v1 event to the event bus:
Screenshot 2026-02-05 at 14 36 38

Queried Athena:
Screenshot 2026-02-05 at 14 52 21

Waited until the next run of the step function and then queried Athena:
Screenshot 2026-02-05 at 15 34 04

Verifying the New Columns are Recorded as Expected

Sent a uk.nhs.notify.digital.letters.print.letter.transitioned.v1 event to the event bus:
Screenshot 2026-02-05 at 15 51 03

Waited until the next run of the step function and then queried Athena:
Screenshot 2026-02-05 at 16 06 54

Type of changes

  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project
  • I have added tests to cover my changes
  • I have updated the documentation accordingly
  • This PR is a result of pair or mob programming
  • If I have used the 'skip-trivy-package' label I have done so responsibly and in the knowledge that this is being fixed as part of a separate ticket/PR.

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

@gareth-allan gareth-allan force-pushed the feature/CCM-13295_reporting_data_ingestion branch from 552dcf1 to 5a49fc8 Compare February 5, 2026 12:17
@gareth-allan gareth-allan marked this pull request as ready for review February 5, 2026 14:26
@gareth-allan gareth-allan requested review from a team as code owners February 5, 2026 14:26
period = 60
statistic = "Sum"
threshold = 1
alarm_description = "This metric monitors failed step function executions"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alarm_description = "This metric monitors failed step function executions"
alarm_description = "This metric monitors aborted step function executions"

simonlabarere
simonlabarere previously approved these changes Feb 6, 2026
simonlabarere
simonlabarere previously approved these changes Feb 9, 2026
aidenvaines-cgi
aidenvaines-cgi previously approved these changes Feb 9, 2026
tdroza-nhs
tdroza-nhs previously approved these changes Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants