mctp: Add retry for one-time peer property queries on timeout by JunYe1993 · Pull Request #129 · CodeConstruct/mctp

JunYe1993 · 2025-10-30T07:33:26Z

The function query_peer_properties() is called once during peer initialization to query basic information after the EID becomes routable. To improve reliability, this change adds a retry mechanism when the query fails with -ETIMEDOUT. Since these queries are one-time initialization steps, a single successful attempt is sufficient, and retrying enhances stability under transient MCTP bus contention or multi-master timing issues.

Testing:
add stress test for peer initialization under multi-master

while true; do
    echo "Restarting mctpd.service..."
    systemctl restart mctpd.service

    # Wait a few seconds to allow service to initialize
    sleep 20
done

After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.

root@bmc:~# journalctl -xeu mctpd.service | grep Retrying
Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2
Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1

jk-ozlabs · 2025-10-31T01:10:57Z

Hi Daniel,

Thanks for the contribution! I think the retries are a good plan there.

I'll put a few comments inline, but as a first edit can you squash the fixes into the original patch? That means we don't have a point in history where tests are failing.

Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so.

jk-ozlabs · 2025-10-31T01:13:37Z

src/mctpd.c

+				if (peer->ctx->verbose)
+					warnx("Retrying to get endpoint types for %s. Attempt %d",
+						peer_tostr(peer), i + 1);
+				rc = query_get_peer_msgtypes(peer);


With this change, we now have duplicate calls to query_get_peer_msgtypes(). If you include the first call in the retry loop you can eliminate this.

No need for including the first warnx warning in the timeout case.

Same with query_get_peer_uuid() below.

Change to

for (unsigned int i = 0; i < max_retries; i++) { rc = query_get_peer_msgtypes(peer); // Success if (rc == 0) break; // On timeout, retry if (rc == -ETIMEDOUT) { if (peer->ctx->verbose) warnx("Retrying to get endpoint types for %s. Attempt %u", peer_tostr(peer), i + 1); continue; } // On other errors, warn and ignore if (rc < 0) { if (peer->ctx->verbose) warnx("Error getting endpoint types for %s. Ignoring error %d %s", peer_tostr(peer), -rc, strerror(-rc)); rc = 0; break; } }

Same with query_get_peer_uuid()

jk-ozlabs · 2025-10-31T01:14:56Z

src/mctpd.c

 static int query_peer_properties(struct peer *peer)
 {
 	int rc;
+	const int max_retries = 4;


Super minor: can you reverse-christmas-tree these, so we would have:

const int max_retries = 4; int rc;

may as well make this unsigned int too, as well as the loop counter.

... and we should make this configurable, but that would be best as a later change. No need to include that in this PR.

JunYe1993 · 2025-10-31T08:48:34Z

Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so.

If we need to build a proper unit test for this behaviour, I’ll need some help — since the function is static in a .c file and not easy to isolate for unit testing.

I did add more details in the PR description showing how I tested this manually in a multi-master environment (restarting mctpd.service repeatedly and checking the journal for retries).

jk-ozlabs · 2025-11-03T08:38:03Z

The test cases for mctpd are in tests/test_mctpd.py, and involve setting up a fake mctp environment (local kernel state, and remote MCTP endpoints) that mctpd can interact with.

I figure we'll need a fake endpoint implementation that drops the first Get Endpoint UUID command, for example.

jk-ozlabs · 2025-12-03T07:45:12Z

Is there anything else you need from me to progress those test cases?

JunYe1993 · 2025-12-03T09:21:29Z

Is there anything else you need from me to progress those test cases?

Sorry for the late reply. Clearly, I could use some help. Could you take a look at the latest PR commit for me ? I have no idea what code I wrote and where went wrong.

jk-ozlabs · 2025-12-04T02:50:40Z

Sorry for the late reply. Clearly, I could use some help. Could you take a look at the latest PR commit for me ? I have no idea what code I wrote and where went wrong.

No problem on the timing, I was just checking to see if you were stuck on something! Happy to take a look.

In case you're not doing so, you can run the test case individually. I tend to do this:

$ pytest ../tests/test_mctpd.py -k test_query_peer_properties_retry_timeout

(from within the build directory, with pytest being from the venv that has the packages from requirements.txt installed)

This gives quite a lot of output on the failure, but the important bit is this:

   | Traceback (most recent call last):
    |   File "/home/jk/devel/mctp/mctp/venv/lib/python3.11/site-packages/pytest_trio/plugin.py", line 195, in _fixture_manager
    |     yield nursery_fixture
    |   File "/home/jk/devel/mctp/mctp/venv/lib/python3.11/site-packages/pytest_trio/plugin.py", line 250, in run
    |     await self._func(**resolved_kwargs)
    |   File "/home/jk/devel/mctp/mctp/tests/test_mctpd.py", line 1330, in test_query_peer_properties_retry_timeout
    |     async for line in mctpd.proc.stdout:
    | TypeError: 'async for' requires an object with __aiter__ method, got NoneType

indicating that this is the issue:

    async for line in mctpd.proc.stdout:
        logs.append(line.decode())

In the test framework, we are not capturing stdout of the mctpd process, so we cannot access proc.stdout

We could do, but I think an easier approach would just be to manually verify that the retry log is present once, and removing that check.

JunYe1993 · 2025-12-04T13:48:09Z

I think I’ve broken the commit chain. Could you help me?
Even after doing a rebase and pushing with --force, I still can’t fix it.

My test case needs to capture stderr from mctpd, because I want to verify the retry-on-timeout behavior. To do that, I modified MctpdWrapper in tests/mctpenv/init.py so that it no longer prints stdout and stderr when the test fails.

Would that be acceptable to you?

JunYe1993 · 2025-12-05T07:20:54Z

I think I’ve broken the commit chain. Could you help me? Even after doing a rebase and pushing with --force, I still can’t fix it.

done.

jk-ozlabs · 2025-12-05T08:57:31Z

My test case needs to capture stderr from mctpd, because I want to verify the retry-on-timeout behavior. To do that, I modified MctpdWrapper in tests/mctpenv/init.py so that it no longer prints stdout and stderr when the test fails.

Would that be acceptable to you?

No, we really need the mctpd output for debugging any failures.

I would suggest that you shouldn't need to assert specific behaviours in the logs, more the behaviour over the actual MCTP interfaces (ie., dbus and MCTP messaging)

The function `query_peer_properties()` is called once during peer initialization to query basic information after the EID becomes routable. To improve reliability, this change adds a retry mechanism when the query fails with `-ETIMEDOUT`. Since these queries are one-time initialization steps, a single successful attempt is sufficient, and retrying enhances stability under transient MCTP bus contention or multi-master timing issues. Testing: add stress test for peer initialization under multi-master ``` while true; do echo "Restarting mctpd.service..." systemctl restart mctpd.service # Wait a few seconds to allow service to initialize sleep 20 done ``` After the 30 loops, the script checks mctpd.service journal for expected retry messages to verify robustness under transient MCTP bus contention. ``` root@bmc:~# journalctl -xeu mctpd.service | grep Retrying Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1 Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1 Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2 Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1 ``` Signed-off-by: Daniel Hsu <Daniel-Hsu@quantatw.com>

JunYe1993 · 2025-12-10T09:05:33Z

Update it so that the endpoint continuously increases the timeout count, and verify whether the object support types generated by mctpd match expectations.

jk-ozlabs · 2026-01-05T03:08:27Z

Looks good, thank you! I'll get this merged shortly.

jk-ozlabs · 2026-01-05T04:45:58Z

Merged, plus a few follow-up changes. Thanks for the contribution!

JunYe1993 force-pushed the mctp_add_retry_ branch from 25e6a14 to a7f8281 Compare October 30, 2025 08:52

jk-ozlabs requested changes Oct 31, 2025

View reviewed changes

JunYe1993 force-pushed the mctp_add_retry_ branch 2 times, most recently from 2de7b87 to 18ab1b5 Compare October 31, 2025 06:31

JunYe1993 force-pushed the mctp_add_retry_ branch from 18ab1b5 to 58649c4 Compare December 3, 2025 09:01

JunYe1993 force-pushed the mctp_add_retry_ branch 4 times, most recently from 0eef8c0 to 8cec7aa Compare December 4, 2025 13:38

JunYe1993 force-pushed the mctp_add_retry_ branch 3 times, most recently from 507a54c to 2f9bf92 Compare December 5, 2025 07:19

JunYe1993 force-pushed the mctp_add_retry_ branch from 2f9bf92 to a6bfe82 Compare December 10, 2025 08:59

JunYe1993 requested a review from jk-ozlabs December 17, 2025 01:48

jk-ozlabs merged commit a6bfe82 into CodeConstruct:main Jan 5, 2026
3 checks passed

Conversation

JunYe1993 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jk-ozlabs commented Oct 31, 2025

Uh oh!

jk-ozlabs Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

JunYe1993 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

jk-ozlabs Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

jk-ozlabs Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

JunYe1993 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

JunYe1993 commented Oct 31, 2025

Uh oh!

jk-ozlabs commented Nov 3, 2025

Uh oh!

jk-ozlabs commented Dec 3, 2025

Uh oh!

JunYe1993 commented Dec 3, 2025

Uh oh!

jk-ozlabs commented Dec 4, 2025

Uh oh!

JunYe1993 commented Dec 4, 2025

Uh oh!

JunYe1993 commented Dec 5, 2025

Uh oh!

jk-ozlabs commented Dec 5, 2025

Uh oh!

JunYe1993 commented Dec 10, 2025

Uh oh!

jk-ozlabs commented Jan 5, 2026

Uh oh!

Uh oh!

jk-ozlabs commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JunYe1993 commented Oct 30, 2025 •

edited

Loading