-
Notifications
You must be signed in to change notification settings - Fork 8
Implement async DNS resolution for epoll backend #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add working resolver implementation for Linux using a worker thread approach. Since POSIX getaddrinfo() is blocking and cannot be monitored with epoll, each resolution spawns a dedicated thread that runs the blocking call and posts completion back to the scheduler. Implementation details: - epoll_resolve_op inherits from scheduler_op (not epoll_op) since DNS resolution doesn't use epoll file descriptors - Worker threads are detached and self-managing; shared_ptr capture keeps the impl alive until completion - Cancellation uses an atomic flag checked after getaddrinfo() returns (the call itself cannot be interrupted) - Follows the same shared_ptr lifetime pattern as sockets.hpp for safe cleanup during shutdown Files changed: - src/corosio/src/detail/epoll/resolver_service.hpp: Major rewrite with epoll_resolve_op structure, updated epoll_resolver_impl with enable_shared_from_this, and epoll_resolver_service with scheduler integration - src/corosio/src/detail/epoll/resolver_service.cpp: New implementation file with flags_to_hints(), convert_results(), make_gai_error() helpers and the worker thread logic Also adds comprehensive cross-platform unit tests for the resolver: - Construction and move semantics - Basic resolution (localhost, numeric IPv4/IPv6, service names) - Error handling (invalid hosts, flag violations) - Cancellation behavior - Sequential resolution operations - io_result and structured binding usage - resolve_flags bitwise operators - resolver_results and resolver_entry APIs Tests use localhost and numeric IPs to avoid external DNS dependencies in CI environments.
📝 WalkthroughWalkthroughImplements epoll-based asynchronous DNS resolution by introducing resolver operation structs, service classes, and worker thread coordination. Uses getaddrinfo on worker threads with results posted to the epoll scheduler. Includes comprehensive unit tests covering construction, resolution scenarios, error handling, and cancellation. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant epoll_resolver_impl
participant epoll_resolver_service
participant Worker Thread
participant Scheduler
participant Coroutine
Caller->>epoll_resolver_impl: resolve(host, service, flags, token)
epoll_resolver_impl->>epoll_resolver_service: work_started()
epoll_resolver_impl->>epoll_resolve_op: reset() and configure
epoll_resolver_impl->>epoll_resolve_op: start(stop_token)
epoll_resolve_op->>Worker Thread: spawn detached thread
epoll_resolve_op-->>epoll_resolver_impl: return
Worker Thread->>Worker Thread: build addrinfo hints from flags
Worker Thread->>Worker Thread: call getaddrinfo(host, service)
alt Resolution succeeds
Worker Thread->>Worker Thread: convert_results to resolver_results
else Resolution fails
Worker Thread->>Worker Thread: make_gai_error mapping
end
Worker Thread->>epoll_resolve_op: complete with result/error
epoll_resolve_op->>epoll_resolve_op: apply result to ec_out/out
epoll_resolve_op->>epoll_resolver_service: post(this)
epoll_resolve_op->>epoll_resolver_service: work_finished()
epoll_resolver_service->>Scheduler: enqueue completion operation
Scheduler->>epoll_resolve_op: operator()() invoke
epoll_resolve_op->>Coroutine: resume with results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The implementation involves intricate async coordination between worker threads, coroutine lifecycle management, cancellation semantics, and scheduler integration. Multiple interacting classes with non-trivial state machines and error handling require careful validation of thread-safety, lifecycle correctness, and getaddrinfo result conversion logic. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
|
An automated preview of the documentation is available at https://54.corosio.prtest3.cppalliance.org/index.html If more commits are pushed to the pull request, the docs will rebuild at the same URL. 2026-01-22 15:48:17 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/corosio/src/detail/epoll/resolver_service.cpp`:
- Around line 169-184: In epoll_resolve_op::operator() ensure ec_out is
explicitly cleared on success: when ec_out is non-null and the operation is not
cancelled and gai_error == 0, set *ec_out = {}; then continue to move
stored_results into *out; keep checks using cancelled.load(...) and gai_error
as-is so only successful resolution paths clear the error code (refer to ec_out,
cancelled, gai_error, out, stored_results and epoll_resolve_op::operator()).
- Around line 250-285: The thread creation can throw and work_started() must be
balanced and post() must happen before work_finished(); wrap the std::thread
construction (the worker lambda that captures shared_from_this()) in a
try-catch(std::system_error) and in the catch call svc_.post(&op_) then
svc_.work_finished() to ensure the operation completes; also change the order
inside the worker lambda to call svc_.post(&op_) before svc_.work_finished()
after resolving addresses (where you set op_.stored_results via convert_results,
op_.gai_error, freeaddrinfo, and inspect op_.cancelled).
In `@src/corosio/src/detail/epoll/resolver_service.hpp`:
- Around line 145-158: The resolver implementation reuses a single
epoll_resolve_op (op_) and lacks protection against concurrent resolve calls;
update epoll_resolver_impl to enforce the single-inflight contract by adding a
guard (e.g., an atomic<bool> op_in_flight or a simple assert) checked at the
start of resolve() and cleared in cancel() and when the operation completes;
reference op_, resolve(), cancel(), and epoll_resolve_op when implementing so
you set the flag before starting the op and reset it on cancellation/completion
(or replace with assert(!op_in_flight) if you prefer a debug-only check), and
add a short class-level comment stating the single-inflight requirement.
🧹 Nitpick comments (1)
test/unit/resolver.cpp (1)
294-357: Keep invalid-host test hermetic.Even with
.invalid,getaddrinfocan still hit external DNS and be flaky on network-restricted CI. Consider using a deterministic failure that avoids DNS (e.g., numeric-host with an invalid literal) or marking this as an integration test.
| void | ||
| epoll_resolve_op:: | ||
| operator()() | ||
| { | ||
| stop_cb.reset(); // Disconnect stop callback | ||
|
|
||
| if (ec_out) | ||
| { | ||
| if (cancelled.load(std::memory_order_acquire)) | ||
| *ec_out = capy::error::canceled; | ||
| else if (gai_error != 0) | ||
| *ec_out = make_gai_error(gai_error); | ||
| } | ||
|
|
||
| if (out && !cancelled.load(std::memory_order_acquire) && gai_error == 0) | ||
| *out = std::move(stored_results); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure ec_out is cleared on success.
If the caller reuses the same system::error_code, a successful resolve can leave a stale error code behind. Consider explicitly clearing it on success.
✅ Proposed fix
if (ec_out)
{
if (cancelled.load(std::memory_order_acquire))
*ec_out = capy::error::canceled;
else if (gai_error != 0)
*ec_out = make_gai_error(gai_error);
+ else
+ *ec_out = {};
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void | |
| epoll_resolve_op:: | |
| operator()() | |
| { | |
| stop_cb.reset(); // Disconnect stop callback | |
| if (ec_out) | |
| { | |
| if (cancelled.load(std::memory_order_acquire)) | |
| *ec_out = capy::error::canceled; | |
| else if (gai_error != 0) | |
| *ec_out = make_gai_error(gai_error); | |
| } | |
| if (out && !cancelled.load(std::memory_order_acquire) && gai_error == 0) | |
| *out = std::move(stored_results); | |
| void | |
| epoll_resolve_op:: | |
| operator()() | |
| { | |
| stop_cb.reset(); // Disconnect stop callback | |
| if (ec_out) | |
| { | |
| if (cancelled.load(std::memory_order_acquire)) | |
| *ec_out = capy::error::canceled; | |
| else if (gai_error != 0) | |
| *ec_out = make_gai_error(gai_error); | |
| else | |
| *ec_out = {}; | |
| } | |
| if (out && !cancelled.load(std::memory_order_acquire) && gai_error == 0) | |
| *out = std::move(stored_results); |
🤖 Prompt for AI Agents
In `@src/corosio/src/detail/epoll/resolver_service.cpp` around lines 169 - 184, In
epoll_resolve_op::operator() ensure ec_out is explicitly cleared on success:
when ec_out is non-null and the operation is not cancelled and gai_error == 0,
set *ec_out = {}; then continue to move stored_results into *out; keep checks
using cancelled.load(...) and gai_error as-is so only successful resolution
paths clear the error code (refer to ec_out, cancelled, gai_error, out,
stored_results and epoll_resolve_op::operator()).
| svc_.work_started(); | ||
|
|
||
| auto self = shared_from_this(); | ||
| std::thread worker([this, self = std::move(self)]() { | ||
| struct addrinfo hints{}; | ||
| hints.ai_family = AF_UNSPEC; | ||
| hints.ai_socktype = SOCK_STREAM; | ||
| hints.ai_flags = flags_to_hints(op_.flags); | ||
|
|
||
| struct addrinfo* ai = nullptr; | ||
| int result = ::getaddrinfo( | ||
| op_.host.empty() ? nullptr : op_.host.c_str(), | ||
| op_.service.empty() ? nullptr : op_.service.c_str(), | ||
| &hints, &ai); | ||
|
|
||
| if (!op_.cancelled.load(std::memory_order_acquire)) | ||
| { | ||
| if (result == 0 && ai) | ||
| { | ||
| op_.stored_results = convert_results(ai, op_.host, op_.service); | ||
| op_.gai_error = 0; | ||
| } | ||
| else | ||
| { | ||
| op_.gai_error = result; | ||
| } | ||
| } | ||
|
|
||
| if (ai) | ||
| ::freeaddrinfo(ai); | ||
|
|
||
| svc_.work_finished(); | ||
| svc_.post(&op_); | ||
| }); | ||
| worker.detach(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -name "resolver_service.cpp" -type f 2>/dev/nullRepository: cppalliance/corosio
Length of output: 165
🏁 Script executed:
# Check repository structure
ls -la | head -20
git ls-files | grep -i resolver | head -20Repository: cppalliance/corosio
Length of output: 1542
🏁 Script executed:
wc -l ./src/corosio/src/detail/epoll/resolver_service.cppRepository: cppalliance/corosio
Length of output: 118
🏁 Script executed:
sed -n '240,295p' ./src/corosio/src/detail/epoll/resolver_service.cppRepository: cppalliance/corosio
Length of output: 1412
🏁 Script executed:
rg "work_started|work_finished" ./src/corosio/src/detail/epoll/resolver_service.cpp -B2 -A2Repository: cppalliance/corosio
Length of output: 422
🏁 Script executed:
rg "svc_\s*=" ./src/corosio/src/detail/epoll/resolver_service.cpp -B2 -A2Repository: cppalliance/corosio
Length of output: 45
🏁 Script executed:
cat -n ./src/corosio/src/detail/epoll/resolver_service.hpp | head -100Repository: cppalliance/corosio
Length of output: 3927
🏁 Script executed:
rg "work_started|work_finished" ./src/corosio/src/detail/epoll/resolver_service.hppRepository: cppalliance/corosio
Length of output: 131
🏁 Script executed:
rg "work_started|work_finished" ./include/ -lRepository: cppalliance/corosio
Length of output: 146
🏁 Script executed:
rg "class io_context|class resolver_service" ./include/ -A10Repository: cppalliance/corosio
Length of output: 1457
🏁 Script executed:
wc -l ./src/corosio/src/detail/epoll/resolver_service.cpp
head -50 ./src/corosio/src/detail/epoll/resolver_service.cppRepository: cppalliance/corosio
Length of output: 1508
🏁 Script executed:
sed -n '240,300p' ./src/corosio/src/detail/epoll/resolver_service.cpp | cat -nRepository: cppalliance/corosio
Length of output: 2048
🏁 Script executed:
cat ./src/corosio/src/detail/epoll/resolver_service.cppRepository: cppalliance/corosio
Length of output: 9066
Guard thread-creation failure and reorder post() before work_finished().
std::thread construction can throw std::system_error when system resources are exhausted; if unhandled, work_started() remains unbalanced and the operation never completes. Additionally, the current code calls work_finished() before post(), creating a race where the scheduler may exit after detecting all work is done but before the completion is enqueued.
Wrap the std::thread creation in try-catch, and reorder the calls in the thread lambda so post(&op_) is invoked before work_finished(). The catch handler should call both post() and work_finished() in the same order to maintain consistency.
✅ Proposed fix
svc_.work_started();
auto self = shared_from_this();
- std::thread worker([this, self = std::move(self)]() {
- struct addrinfo hints{};
- hints.ai_family = AF_UNSPEC;
- hints.ai_socktype = SOCK_STREAM;
- hints.ai_flags = flags_to_hints(op_.flags);
-
- struct addrinfo* ai = nullptr;
- int result = ::getaddrinfo(
- op_.host.empty() ? nullptr : op_.host.c_str(),
- op_.service.empty() ? nullptr : op_.service.c_str(),
- &hints, &ai);
-
- if (!op_.cancelled.load(std::memory_order_acquire))
- {
- if (result == 0 && ai)
- {
- op_.stored_results = convert_results(ai, op_.host, op_.service);
- op_.gai_error = 0;
- }
- else
- {
- op_.gai_error = result;
- }
- }
-
- if (ai)
- ::freeaddrinfo(ai);
-
- svc_.work_finished();
- svc_.post(&op_);
- });
- worker.detach();
+ try
+ {
+ std::thread worker([this, self = std::move(self)]() {
+ struct addrinfo hints{};
+ hints.ai_family = AF_UNSPEC;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_flags = flags_to_hints(op_.flags);
+
+ struct addrinfo* ai = nullptr;
+ int result = ::getaddrinfo(
+ op_.host.empty() ? nullptr : op_.host.c_str(),
+ op_.service.empty() ? nullptr : op_.service.c_str(),
+ &hints, &ai);
+
+ if (!op_.cancelled.load(std::memory_order_acquire))
+ {
+ if (result == 0 && ai)
+ {
+ op_.stored_results = convert_results(ai, op_.host, op_.service);
+ op_.gai_error = 0;
+ }
+ else
+ {
+ op_.gai_error = result;
+ }
+ }
+
+ if (ai)
+ ::freeaddrinfo(ai);
+
+ svc_.post(&op_);
+ svc_.work_finished();
+ });
+ worker.detach();
+ }
+ catch (...)
+ {
+ op_.gai_error = EAI_AGAIN;
+ svc_.post(&op_);
+ svc_.work_finished();
+ }🤖 Prompt for AI Agents
In `@src/corosio/src/detail/epoll/resolver_service.cpp` around lines 250 - 285,
The thread creation can throw and work_started() must be balanced and post()
must happen before work_finished(); wrap the std::thread construction (the
worker lambda that captures shared_from_this()) in a
try-catch(std::system_error) and in the catch call svc_.post(&op_) then
svc_.work_finished() to ensure the operation completes; also change the order
inside the worker lambda to call svc_.post(&op_) before svc_.work_finished()
after resolving addresses (where you set op_.stored_results via convert_results,
op_.gai_error, freeaddrinfo, and inspect op_.cancelled).
| void resolve( | ||
| std::coroutine_handle<>, | ||
| capy::executor_ref, | ||
| std::string_view /*host*/, | ||
| std::string_view /*service*/, | ||
| resolve_flags /*flags*/, | ||
| std::string_view host, | ||
| std::string_view service, | ||
| resolve_flags flags, | ||
| std::stop_token, | ||
| system::error_code*, | ||
| resolver_results*) override | ||
| { | ||
| throw std::logic_error("epoll resolver resolve not implemented"); | ||
| } | ||
| resolver_results*) override; | ||
|
|
||
| void cancel() noexcept; | ||
|
|
||
| void cancel() noexcept { /* stub */ } | ||
| epoll_resolve_op op_; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's understand the structure of the resolver implementation
find . -type f -name "*.hpp" -o -name "*.cpp" | grep -i resolver | head -20Repository: cppalliance/corosio
Length of output: 406
🏁 Script executed:
# Look at the file in question and its context
head -200 src/corosio/src/detail/epoll/resolver_service.hppRepository: cppalliance/corosio
Length of output: 5901
🏁 Script executed:
# Search for resolver_impl class definition and usage
rg "class epoll_resolver_impl|class.*resolver_impl" -t cpp -A 10Repository: cppalliance/corosio
Length of output: 3555
🏁 Script executed:
# Look for resolve() method definitions and their usage
rg "void resolve\(" -t cpp -B 2 -A 5Repository: cppalliance/corosio
Length of output: 1807
🏁 Script executed:
# Search for any documentation about resolver concurrency or single-inflight
rg "single.*inflight|in-flight|concurrent.*resolve|resolve.*concurrent|one.*resolve" -t cpp -t hpp -t md -iRepository: cppalliance/corosio
Length of output: 90
🏁 Script executed:
# Look for other resolver backends to see how they handle this
find . -type f \( -name "*resolver*" -o -name "*epoll*" \) | grep -E "\.(hpp|cpp)$" | head -30Repository: cppalliance/corosio
Length of output: 406
🏁 Script executed:
# Look at the epoll resolver implementation
cat src/corosio/src/detail/epoll/resolver_service.cppRepository: cppalliance/corosio
Length of output: 9066
🏁 Script executed:
# Look at the IOCP (Windows) resolver to see how it handles multiple ops
cat src/corosio/src/detail/iocp/resolver_service.hppRepository: cppalliance/corosio
Length of output: 5088
🏁 Script executed:
# Check the public resolver API
cat include/boost/corosio/resolver.hppRepository: cppalliance/corosio
Length of output: 9485
🏁 Script executed:
# Look at test code for usage patterns
grep -n "resolve\|cancel" test/unit/resolver.cpp -B 2 -A 5 | head -100Repository: cppalliance/corosio
Length of output: 2463
Add documentation or assertion for single-inflight resolve contract.
The epoll (and Windows) resolver implementations reuse a single operation object. While the public API documentation requires non-concurrent resolve calls on the same resolver instance, the implementation lacks defensive measures. Add either a clear comment on the epoll_resolver_impl class documenting the single-inflight contract, or implement an assertion/flag in resolve() to catch violations (e.g., assert(!op_in_flight) or use an atomic flag to detect overlapping calls).
🤖 Prompt for AI Agents
In `@src/corosio/src/detail/epoll/resolver_service.hpp` around lines 145 - 158,
The resolver implementation reuses a single epoll_resolve_op (op_) and lacks
protection against concurrent resolve calls; update epoll_resolver_impl to
enforce the single-inflight contract by adding a guard (e.g., an atomic<bool>
op_in_flight or a simple assert) checked at the start of resolve() and cleared
in cancel() and when the operation completes; reference op_, resolve(),
cancel(), and epoll_resolve_op when implementing so you set the flag before
starting the op and reset it on cancellation/completion (or replace with
assert(!op_in_flight) if you prefer a debug-only check), and add a short
class-level comment stating the single-inflight requirement.
Add working resolver implementation for Linux using a worker thread approach. Since POSIX getaddrinfo() is blocking and cannot be monitored with epoll, each resolution spawns a dedicated thread that runs the blocking call and posts completion back to the scheduler.
Implementation details:
Files changed:
Also adds comprehensive cross-platform unit tests for the resolver:
Tests use localhost and numeric IPs to avoid external DNS dependencies in CI environments.
Resolves #53.
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.