mctp: Add retry for one-time peer property queries on timeout#129
mctp: Add retry for one-time peer property queries on timeout#129jk-ozlabs merged 1 commit intoCodeConstruct:mainfrom
Conversation
25e6a14 to
a7f8281
Compare
|
Hi Daniel, Thanks for the contribution! I think the retries are a good plan there. I'll put a few comments inline, but as a first edit can you squash the fixes into the original patch? That means we don't have a point in history where tests are failing. Speaking of tests, can you add some for this behaviour? let me know if you need a hand doing so. |
src/mctpd.c
Outdated
| if (peer->ctx->verbose) | ||
| warnx("Retrying to get endpoint types for %s. Attempt %d", | ||
| peer_tostr(peer), i + 1); | ||
| rc = query_get_peer_msgtypes(peer); |
There was a problem hiding this comment.
With this change, we now have duplicate calls to query_get_peer_msgtypes(). If you include the first call in the retry loop you can eliminate this.
No need for including the first warnx warning in the timeout case.
Same with query_get_peer_uuid() below.
There was a problem hiding this comment.
Change to
for (unsigned int i = 0; i < max_retries; i++) {
rc = query_get_peer_msgtypes(peer);
// Success
if (rc == 0)
break;
// On timeout, retry
if (rc == -ETIMEDOUT) {
if (peer->ctx->verbose)
warnx("Retrying to get endpoint types for %s. Attempt %u",
peer_tostr(peer), i + 1);
continue;
}
// On other errors, warn and ignore
if (rc < 0) {
if (peer->ctx->verbose)
warnx("Error getting endpoint types for %s. Ignoring error %d %s",
peer_tostr(peer), -rc, strerror(-rc));
rc = 0;
break;
}
}Same with query_get_peer_uuid()
src/mctpd.c
Outdated
| static int query_peer_properties(struct peer *peer) | ||
| { | ||
| int rc; | ||
| const int max_retries = 4; |
There was a problem hiding this comment.
Super minor: can you reverse-christmas-tree these, so we would have:
const int max_retries = 4;
int rc;may as well make this unsigned int too, as well as the loop counter.
There was a problem hiding this comment.
... and we should make this configurable, but that would be best as a later change. No need to include that in this PR.
2de7b87 to
18ab1b5
Compare
If we need to build a proper unit test for this behaviour, I’ll need some help — since the function is static in a .c file and not easy to isolate for unit testing. I did add more details in the PR description showing how I tested this manually in a multi-master environment (restarting mctpd.service repeatedly and checking the journal for retries). |
|
The test cases for mctpd are in I figure we'll need a fake endpoint implementation that drops the first Get Endpoint UUID command, for example. |
|
Is there anything else you need from me to progress those test cases? |
18ab1b5 to
58649c4
Compare
Sorry for the late reply. Clearly, I could use some help. Could you take a look at the latest PR commit for me ? I have no idea what code I wrote and where went wrong. |
No problem on the timing, I was just checking to see if you were stuck on something! Happy to take a look. In case you're not doing so, you can run the test case individually. I tend to do this: $ pytest ../tests/test_mctpd.py -k test_query_peer_properties_retry_timeout(from within the build directory, with This gives quite a lot of output on the failure, but the important bit is this: | Traceback (most recent call last):
| File "/home/jk/devel/mctp/mctp/venv/lib/python3.11/site-packages/pytest_trio/plugin.py", line 195, in _fixture_manager
| yield nursery_fixture
| File "/home/jk/devel/mctp/mctp/venv/lib/python3.11/site-packages/pytest_trio/plugin.py", line 250, in run
| await self._func(**resolved_kwargs)
| File "/home/jk/devel/mctp/mctp/tests/test_mctpd.py", line 1330, in test_query_peer_properties_retry_timeout
| async for line in mctpd.proc.stdout:
| TypeError: 'async for' requires an object with __aiter__ method, got NoneTypeindicating that this is the issue: async for line in mctpd.proc.stdout:
logs.append(line.decode())In the test framework, we are not capturing stdout of the mctpd process, so we cannot access We could do, but I think an easier approach would just be to manually verify that the retry log is present once, and removing that check. |
0eef8c0 to
8cec7aa
Compare
|
I think I’ve broken the commit chain. Could you help me? My test case needs to capture stderr from mctpd, because I want to verify the retry-on-timeout behavior. To do that, I modified MctpdWrapper in tests/mctpenv/init.py so that it no longer prints stdout and stderr when the test fails. Would that be acceptable to you? |
507a54c to
2f9bf92
Compare
done. |
No, we really need the mctpd output for debugging any failures. I would suggest that you shouldn't need to assert specific behaviours in the logs, more the behaviour over the actual MCTP interfaces (ie., dbus and MCTP messaging) |
The function `query_peer_properties()` is called once during peer initialization
to query basic information after the EID becomes routable.
To improve reliability, this change adds a retry mechanism when the query
fails with `-ETIMEDOUT`. Since these queries are one-time initialization steps,
a single successful attempt is sufficient, and retrying enhances stability
under transient MCTP bus contention or multi-master timing issues.
Testing:
add stress test for peer initialization under multi-master
```
while true; do
echo "Restarting mctpd.service..."
systemctl restart mctpd.service
# Wait a few seconds to allow service to initialize
sleep 20
done
```
After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.
```
root@bmc:~# journalctl -xeu mctpd.service | grep Retrying
Oct 29 00:35:21 bmc mctpd[31801]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:00 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
Oct 29 00:39:01 bmc mctpd[32065]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 2
Oct 29 00:45:08 bmc mctpd[32360]: mctpd: Retrying to get endpoint types for peer eid 10 net 1 phys physaddr if 4 hw len 1 0x20 state 1. Attempt 1
```
Signed-off-by: Daniel Hsu <[email protected]>
2f9bf92 to
a6bfe82
Compare
|
Update it so that the endpoint continuously increases the timeout count, and verify whether the object support types generated by mctpd match expectations. |
|
Looks good, thank you! I'll get this merged shortly. |
|
Merged, plus a few follow-up changes. Thanks for the contribution! |
The function
query_peer_properties()is called once during peer initialization to query basic information after the EID becomes routable. To improve reliability, this change adds a retry mechanism when the query fails with-ETIMEDOUT. Since these queries are one-time initialization steps, a single successful attempt is sufficient, and retrying enhances stability under transient MCTP bus contention or multi-master timing issues.Testing:
add stress test for peer initialization under multi-master
After the 30 loops, the script checks mctpd.service journal for expected
retry messages to verify robustness under transient MCTP bus contention.