fix: Do not share state between different crawlers unless requested #1669

Pijukatel · 2026-01-12T14:28:29Z

Description

Introduces a new argument crawler_id for BasicCrawler. This argument controls the shared state.

Draft for discussion. The main drawback is that this is another unique way of sharing something between different crawlers. Similar, but different existing approaches:

Statistics can be shared by explicitly passing an existing instance to the crawler.
Storages in general can be shared by properly setting a combination of configuration and storage_client arguments to the crawler
Storages can be shared by relying on default values (will be reused by default)
RequestManager can be reused by explicitly passing an existing instance to the request_manager argument

What are the alternatives?

Explicitly passing state_kvs instead of crawler_id, otherwise autoincrement state counter - this is more aligned with the existing approach of how Statistics can be re-used
Bring default_kvs_id Configuration value from SDK level to Crawlee level. This would allow to share or not share state based on what would be in the Configuration(default_kvs_id=...) (if using the same storage client...)
Ignore and just document current behavior
...

Issues

Closes: #1627

Testing

TODO

Checklist

CI passed

codecov · 2026-01-12T14:46:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.42%. Comparing base (0a0995c) to head (1bbb651).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1669      +/-   ##
==========================================
+ Coverage   92.41%   92.42%   +0.01%     
==========================================
  Files         157      157              
  Lines       10478    10494      +16     
==========================================
+ Hits         9683     9699      +16     
  Misses        795      795

Flag	Coverage Δ
unit	`92.42% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Pijukatel · 2026-01-12T14:49:44Z

Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description)

@janbuchar , @barjin, @B4nan. Could you please share your point of view?

You can see the usage in code in the updated and new test in this PR.

barjin · 2026-01-12T15:08:00Z

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

B4nan

I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.

B4nan · 2026-01-12T16:05:28Z

src/crawlee/crawlers/_basic/_basic_crawler.py

        status_message_logging_interval: timedelta = timedelta(seconds=10),
        status_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
        | None = None,
+        crawler_id: int | None = None,


the option should be called just id as in the JS version, right?

Maybe Pepa chose that name because id is a bultin function in Python. If that's the case, I think we can safely shadow it.

@janbuchar exactly. But just id is fine for me as well.

The point why I dislike it is, that "crawler.id" is very ambiguous, and it can be implemented to control:

only the state - people might get confused, why does id control only sharing of the state and not statistics? In that case it is not explicitly obvious from the argument and it would be better called something like state_id.

state and statistics - this would cause a situation where you can either share both or none and also Statistics can already be shared implicitly.

Pijukatel · 2026-01-13T09:22:05Z

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

This will be an easy and clear default without the need for extra arguments and maximum flexibility for custom use cases.

janbuchar · 2026-01-13T13:54:33Z

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

I have a hard time imagining that, could you sketch out some code samples?

Only for discussion, types ignored for now.

Pijukatel · 2026-01-13T16:00:04Z

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

I have a hard time imagining that, could you sketch out some code samples?

Please check the latest commit. I added an example of how this could be done. (Please do not focus on that specific example; it is just to demonstrate the idea. The question is whether the use_state should be some hardcoded internal that can be parametrized, or if it should be a component of the crawler that can be fully replaced by a custom implementation. )

janbuchar · 2026-01-13T16:23:41Z

tests/unit/crawlers/_basic/test_basic_crawler.py

+    async def custom_use_state(default_state: dict[str, JsonSerializable]) -> dict[str, JsonSerializable]:
+        if not custom_state_dict:
+            custom_state_dict.update(default_state)
+        return custom_state_dict


The fact that this is not persistent by default will definitely surprise someone.

This is just a totally basic custom implementation of use_state for the sake of the test. By no means to be used anywhere outside of this test.

This is not the default implementation.

Sure, what I mean is that users could be surprised by the fact that if they supply a custom use_state, we don't handle persistence for them.

Expand existing test

88e0fb1

github-actions bot assigned Pijukatel Jan 12, 2026

github-actions bot added this to the 132nd sprint - Tooling team milestone Jan 12, 2026

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 12, 2026

Version 1: State depends on crawler_id, but stats does not.

31e16d2

Pijukatel force-pushed the add-crawler-id branch from 6ff0b3f to 31e16d2 Compare January 12, 2026 14:42

Pijukatel changed the title ~~Add crawler~~ fix: Do not share state between different crawlers unless requested Jan 12, 2026

B4nan reviewed Jan 12, 2026

View reviewed changes

Draft of use_state as input argument

1bbb651

Only for discussion, types ignored for now.

Pijukatel mentioned this pull request Jan 13, 2026

docs: State persistence update apify/apify-docs#2176

Open

janbuchar reviewed Jan 13, 2026

View reviewed changes

fix: Do not share state between different crawlers unless requested #1669

Are you sure you want to change the base?

fix: Do not share state between different crawlers unless requested #1669

Conversation

Pijukatel commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Checklist

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pijukatel commented Jan 12, 2026

Uh oh!

barjin commented Jan 12, 2026

Uh oh!

B4nan left a comment

Choose a reason for hiding this comment

Uh oh!

B4nan Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Pijukatel Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Pijukatel Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Jan 13, 2026

Uh oh!

janbuchar commented Jan 13, 2026

Uh oh!

Pijukatel commented Jan 13, 2026

Uh oh!

janbuchar Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Pijukatel Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Pijukatel commented Jan 12, 2026 •

edited

Loading

codecov bot commented Jan 12, 2026 •

edited

Loading