-
Notifications
You must be signed in to change notification settings - Fork 539
fix: Do not share state between different crawlers unless requested #1669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
6ff0b3f to
31e16d2
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1669 +/- ##
==========================================
+ Coverage 92.41% 92.42% +0.01%
==========================================
Files 157 157
Lines 10478 10494 +16
==========================================
+ Hits 9683 9699 +16
Misses 795 795
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description) @janbuchar , @barjin, @B4nan. Could you please share your point of view? You can see the usage in code in the updated and new test in this PR. |
The Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs. Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit). |
B4nan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.
| status_message_logging_interval: timedelta = timedelta(seconds=10), | ||
| status_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] | ||
| | None = None, | ||
| crawler_id: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the option should be called just id as in the JS version, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Pepa chose that name because id is a bultin function in Python. If that's the case, I think we can safely shadow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janbuchar exactly. But just id is fine for me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point why I dislike it is, that "crawler.id" is very ambiguous, and it can be implemented to control:
- only the state - people might get confused, why does
idcontrol only sharing of the state and not statistics? In that case it is not explicitly obvious from the argument and it would be better called something likestate_id. - state and statistics - this would cause a situation where you can either share both or none and also
Statisticscan already be shared implicitly.
What about having an optional argument This will be an easy and clear default without the need for extra arguments and maximum flexibility for custom use cases. |
I have a hard time imagining that, could you sketch out some code samples? |
Only for discussion, types ignored for now.
Please check the latest commit. I added an example of how this could be done. (Please do not focus on that specific example; it is just to demonstrate the idea. The question is whether the |
| async def custom_use_state(default_state: dict[str, JsonSerializable]) -> dict[str, JsonSerializable]: | ||
| if not custom_state_dict: | ||
| custom_state_dict.update(default_state) | ||
| return custom_state_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this is not persistent by default will definitely surprise someone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a totally basic custom implementation of use_state for the sake of the test. By no means to be used anywhere outside of this test.
This is not the default implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, what I mean is that users could be surprised by the fact that if they supply a custom use_state, we don't handle persistence for them.
Description
crawler_idforBasicCrawler. This argument controls the shared state.Draft for discussion. The main drawback is that this is another unique way of sharing something between different crawlers. Similar, but different existing approaches:
Statisticscan be shared by explicitly passing an existing instance to the crawler.configurationandstorage_clientarguments to the crawlerRequestManagercan be reused by explicitly passing an existing instance to therequest_managerargumentWhat are the alternatives?
state_kvsinstead ofcrawler_id, otherwise autoincrement state counter - this is more aligned with the existing approach of howStatisticscan be re-useddefault_kvs_idConfiguration value from SDK level to Crawlee level. This would allow to share or not share state based on what would be in theConfiguration(default_kvs_id=...)(if using the same storage client...)Issues
Closes: #1627
Testing
Checklist