feat: Add opt-in per-domain request throttling for HTTP 429 backoff by MrAliHasan · Pull Request #1762 · apify/crawlee-python

MrAliHasan · 2026-02-20T21:31:44Z

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), requests get retried without any per-domain delay — potentially making rate limiting worse.

Solution

Introduces the ThrottlingRequestManager, an opt-in request manager wrapper that enforces per-domain delays at the scheduling layer.

Note: Users must explicitly pass a ThrottlingRequestManager as the request_manager to enable throttling. There is no auto-wrapping or implicit behavior change.

Key features:

Per-domain sub-queues — requests for configured domains are routed to dedicated sub-queues at insertion time
HTTP 429 backoff — record_domain_delay() sets a per-domain throttled_until timestamp based on Retry-After headers
robots.txt crawl-delay — BasicCrawler automatically calls set_crawl_delay() when respect_robots_txt_file is enabled and the request manager is a ThrottlingRequestManager
Crawl-delay warning — logs a warning if respect_robots_txt_file is enabled but the request manager is not a ThrottlingRequestManager
Delay-aware scheduling — fetch_next_request() skips throttled domains, falls back to the inner queue, and sleeps only when all sub-queues are throttled and the inner queue is empty
Persistence support — sub-queues use the same storage backend as the inner queue; recreate_purged() handles queue reconstruction across crawler restarts

How it works

Requests are routed to domain-specific sub-queues at insertion time
If a domain is throttled (throttled_until), fetch_next_request() skips it and falls back to the inner queue
record_domain_delay() updates per-domain backoff on HTTP 429 responses, respecting Retry-After headers
set_crawl_delay() integrates robots.txt crawl-delay when enabled
On successful requests, backoff counters reset

Usage

from crawlee.request_loaders import ThrottlingRequestManager
from crawlee.storages import RequestQueue
from crawlee.crawlers import BasicCrawler

queue = await RequestQueue.open()
manager = ThrottlingRequestManager(
    queue,
    domains=["example.com", "api.example.com"]
)
crawler = BasicCrawler(request_manager=manager)

Files changed

File	Change
`_throttling_request_manager.py`	NEW — Per-domain throttling request manager
`http.py`	`parse_retry_after_header` utility
`_basic_crawler.py`	`recreate_purged()` integration, crawl-delay warning
`_playwright_crawler.py`	Pass `Retry-After` header
`test_throttling_request_manager.py`	NEW — 30 unit tests using real `RequestQueue` with `MemoryStorageClient`

Tests

30 new tests covering: domain routing, throttling, delay scheduling, sleep behavior, delegation, recreate_purged(), and edge cases
All 1648 existing tests pass with zero regressions

Future work

This is a focused first step toward a more complete RequestAnalyzer that may include:

robots.txt integration for multiple domains
URL group management
Enhanced per-domain scheduling and analytics

Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437

vdusek · 2026-02-23T11:49:48Z

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

janbuchar

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

src/crawlee/crawlers/_basic/_basic_crawler.py

MrAliHasan · 2026-02-23T16:44:07Z

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396

src/crawlee/crawlers/_basic/_basic_crawler.py

tests/unit/test_throttling_request_manager.py

src/crawlee/request_loaders/_throttling_request_manager.py

…queues and update its integration across crawlers.

MrAliHasan · 2026-02-25T01:21:06Z

Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback.

Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory.

Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure.

BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments).

The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right!

MrAliHasan · 2026-02-26T01:53:35Z

Update: I just pushed a small follow-up commit fixing the MyPy typing and Ruff linting errors in the test suite that were causing the CI to fail. All local checks for ThrottlingRequestManager are now passing 100%. Ready for review whenever you have time!

Pijukatel

Hello, thanks for the work on this PR! I have just some annoying edgecase to think about. I am not sure myself what the best way is to deal with them.

src/crawlee/crawlers/_basic/_basic_crawler.py

Pijukatel · 2026-02-26T12:36:56Z

src/crawlee/request_loaders/_throttling_request_manager.py

+    async def _get_or_create_sub_queue(self, domain: str) -> RequestQueue:
+        """Get or create a per-domain sub-queue."""
+        if domain not in self._sub_queues:
+            self._sub_queues[domain] = await RequestQueue.open(alias=f'throttled-{domain}')


We should think this through. Calling self._sub_queues[domain]=await RequestQueue.open(alias=f'throttled-{domain}') will use the global service_locator to get the configuration and storage client that will be used to create this RQ. On the other hand, inner was created from crawler specific service-locator:

inner = await RequestQueue.open( storage_client=self._service_locator.get_storage_client(), configuration=self._service_locator.get_configuration(), )

So this could lead to unexpected behavior when these two service locators are not the same and thus they could use different configuration and different storage_client.

What are other options?

Hard-code MemoryStorageClient

Hard-code MemoryStorageClient for all self._sub_queues[domain] and use crawler-specific configuration. This would probably work for the majority of scenarios. It would be fast and cheap. But what about a heavily throttled crawler? Imagine a crawler that is crawling only one site, has massive RQ, and that one site has crawl_delay 60s. In such a scenario, the massive RQ would just be loaded to one of the self._sub_queues until the memory limit of the crawler is used. Would it then stay in a deadlock due to memory load?

Use crawler-specific service_locator as init argument to ThrottlingRequestManager and create all self._sub_queues like this

self._sub_queues[domain]=await RequestQueue.open( alias=f'throttled-{domain}', storage_client=self._service_locator.get_storage_client(), configuration=self._service_locator.get_configuration(), )

This seems to me like a safe choice, but it would probably not be the cheapest and fastest. For example, on the Apify platform, it would use Apify-based RQ, which is more expensive and slower than an in-memory one.

Use in-memory as default, but allow custom service_locator?

I would probably prefer this approach. By default, no service locator would be passed to the ThrottlingRequestManager and it would use MemoryStorageClient and for special usecase you could use:

BasicCrawler(request_manager=ThrottlingRequestManager(service_loactor=custom_service_locator))

Use in-memory as default, but limit the max size of the in-memory self._sub_queues?

Pijukatel · 2026-02-26T12:48:07Z

src/crawlee/request_loaders/_throttling_request_manager.py

+    async def mark_request_as_handled(self, request: Request) -> ProcessedRequest | None:
+        origin = self._dispatched_origins.get(request.unique_key)
+        if origin and origin != 'inner' and origin in self._sub_queues:
+            return await self._sub_queues[origin].mark_request_as_handled(request)


Ideally, I think we do not want to just mark the request in self._sub_queues as handled, but completely delete it from the subqueue. There is no point in tracking such a request in a subqueue, as it can consume memory in some implementations.
It is enough that it is marked as handled in the inner. That one already handles all the deduplication needed and that is the only place where we need to store handled requests.

But RQ does not define a delete method so not sure what to do with this...

Pijukatel · 2026-02-26T13:03:43Z

src/crawlee/request_loaders/_throttling_request_manager.py

+            if not self._is_domain_throttled(domain):
+                req = await sq.fetch_next_request()
+                if req:
+                    self._mark_domain_dispatched(req.url)
+                    self._dispatched_origins[req.unique_key] = domain
+                    return req


Imagine a scenario where multiple domains are throttled at the same time. But the first of self._sub_queues has a massive amount of requests. The current approach would prefer such subqueu just due to the order in the self._sub_queues.

I think that the mechanism that decides which subqueue should be used to fetch_next_request should be based on the lowest time in _DomainState.throttled_until. So fetch from the longest overdue one.

Which brings me to another point. Do we even need to track _DomainState.last_request_at? I think we should just track _DomainState.throttled_until and update it on ThrottlingRequestManager._mark_domain_dispatched call.

MrAliHasan · 2026-02-26T22:32:34Z

Hey @Pijukatel, thanks for looking deeply into this! Great catches all around on the edge cases. I spent some time analyzing these points, and here is how I propose we handle them. Let me know if you are aligned on this direction before I push the changes:

1. BasicCrawler initialization

Good catch. I'll simplify _basic_crawler.py to just directly assign self._request_manager = ThrottlingRequestManager(inner).

2. Sub-Queue storage strategy

Regarding the service_locator mismatch: you are absolutely right. If a crawler relies on persistent storage (e.g. Apify Platform) and we silently default the sub-queues to a fast MemoryStorageClient, an interrupted crawl will lose its throttled queues entirely on restart. We would be trading determinism and correctness for raw performance.

Therefore, I propose your "Option 2" as the default, but with the flexibility of "Option 3":

ThrottlingRequestManager will accept an optional service_locator.
By default, BasicCrawler will pass its own self._service_locator down to the throttler so that sub-queues consistently match the persistence mechanism of the main queue.
If a power user really wants pure in-memory sub-queues for cost/speed reasons despite a persistent main queue, they can explicitly pass a custom ServiceLocator into the throttler instance.

3. Fetch priority redesign

I completely agree. Relying on iteration order rather than the longest-overdue domain is a flaw in the scheduling logic.

I will remove _DomainState.last_request_at entirely. Less state means fewer edge cases.
On dispatch, we simply update _DomainState.throttled_until immediately.
fetch_next_request will collect available sub-queues and explicitly sort them by throttled_until, so the domain that has been waiting the longest gets fetched first.

4. Deleting Handled Requests

Because RequestQueue does not currently support .delete(), we can't physically purge them without API additions. However, since the sub-queues only act as temporary routing buffers, this shouldn't be a problem. The source of truth remains the inner queue. The handled tracking duplication in the sub-queue doesn't affect the core deduplication logic at all.

Does this direction look solid to you? If so, I'll get it coded and pushed up!

Pijukatel · 2026-02-27T14:56:32Z

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

We discussed it internally, and there are some open points. Please let me think about it over the weekend, so that I don't point you in the wrong direction.

Pijukatel · 2026-03-02T10:52:38Z

...

Does this direction look solid to you? If so, I'll get it coded and pushed up!

It is going in a good direction. Just a few more edgecases we discussed:

Having ad hoc request queues being created for domains can lead to two undesired scenarios:

If redirecting new requests to such queues, then deduplication is difficult (would have to check two queues)
If not redirecting new requests to such queues, then requests are duplicated, and there is a lot of work done by just duplicating specific domain requests from the main queue to the specialized queue.

To deal with this, we agreed it would be best to have the RequestThrottler be created with specific domains as init arguments. It should be redirecting new requests to such explicitly requested sub-queues. For domains that are not explicitly mentioned in init, it would not do any request throttling.

Preserve existing behavior and introduce `RequestThrottler` in a safe way

It would be best if we start with this feature being optional for now, until we get more usage feedback. So RequestThrottler would have to be explicitly passed to the BasicCrawler. Could you please add a short guide on how to do it and a small code example? (See: https://github.com/apify/crawlee-python/tree/master/docs/guides)

…tling manager.

MrAliHasan · 2026-03-02T19:46:28Z

Hey @Pijukatel, I've pushed the refactor based on your latest feedback. Here's what changed:

Explicit domain routing

ThrottlingRequestManager now requires domains at init — only listed domains are throttled.
add_request / add_requests route listed-domain requests directly to their sub-queue at insertion time. No duplication, no transfers.
Non-listed domains pass through to the main queue untouched.
Sub-queues use the same storage backend as the inner queue by default to preserve persistence across restarts.

Opt-in only

Removed auto-wrapping from BasicCrawler.get_request_manager(). The feature must be explicitly enabled by passing a ThrottlingRequestManager as request_manager.

Simplified state

Removed _DomainState.last_request_at. Only throttled_until is tracked, updated on dispatch.
Eliminated _dispatched_origins and _transferred_requests_count — no longer needed since requests start in the right queue.
fetch_next_request sorts sub-queues by throttled_until ascending (longest-overdue domain first).

Documentation

Added docs/guides/request_throttling.mdx with a short guide and code example showing explicit usage.

All local checks pass (1647 tests, 0 failures). Ready for review!

Pijukatel

Nice.
Now I have just a few more code-related comments.

src/crawlee/_utils/http.py

src/crawlee/crawlers/_basic/_basic_crawler.py

src/crawlee/crawlers/_playwright/_playwright_crawler.py

src/crawlee/request_loaders/_throttling_request_manager.py

tests/unit/test_throttling_request_manager.py

…ctor domain state management and sub-queue handling.

MrAliHasan · 2026-03-04T22:42:26Z

Hey @Pijukatel, I've pushed all the changes from your latest review. Here's what was addressed:

Import cleanup

Moved parsedate_to_datetime import to the top of http.py.
Inlined retry_after_header variable in _playwright_crawler.py.

ThrottlingRequestManager simplifications

Removed redundant _domains set. All _domain_states are now fully populated at init with throttled_until set to now.
_service_locator now defaults to the global service_locator when not explicitly provided, eliminating all if self._service_locator branches in _get_or_create_sub_queue and recreate_purged.
Added if not domain: return guard in record_success for consistency with set_crawl_delay and _mark_domain_dispatched.
Removed if domain not in self._domain_states guards in record_domain_delay, set_crawl_delay, and _mark_domain_dispatched since all states are initialized at init.
Removed the # ────── separator comment block.

Dedicated purge method

Added recreate_purged() method on ThrottlingRequestManager that drops all queues and returns a fresh instance with the same domain configuration and service locator. BasicCrawler now calls this directly instead of accessing private attributes.

Docstrings

Clarified that fetch_next_request falls back to the inner queue when all sub-queues are throttled, and only sleeps when the inner queue is also empty.

Tests

Replaced all AsyncMock queue mocks with real RequestQueue backed by MemoryStorageClient, which also provides the service_locator for sub-queue creation.
Added test for recreate_purged.

All checks pass (1648 tests, 0 failures). Ready for review!

src/crawlee/request_loaders/_throttling_request_manager.py

Pijukatel · 2026-03-05T10:36:58Z

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

MrAliHasan · 2026-03-05T12:40:57Z

@MrAliHasan good work! Please just fix the errors reported by Type check and I have no more comments.

The CI type check (Python 3.10) fails with invalid-return-type on the two return statements in add_request().

It seems the newer ty version used in CI resolves RequestQueue.add_request() as returning ProcessedRequest | None instead of ProcessedRequest, and RequestManager.add_request() as Unknown | ProcessedRequest | None.

One possible fix would be to narrow the type explicitly:

result = await sq.add_request(request, forefront=forefront)
if result is None:
    raise RuntimeError("Unexpected None from add_request()")
return result

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

I also noticed that RequestManagerTandem.add_request() uses the same delegation pattern and does not seem to trigger this issue on CI.

Mantisus · 2026-03-05T15:25:19Z

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

….14 wheels for brotlicffi.

MrAliHasan · 2026-03-05T23:32:14Z

Hi, @MrAliHasan

Thank you for your inspiring work on this PR.

However, this case should never occur in practice, so I wanted to ask first: Is there a preferred approach for handling this kind of ty false positive? For example, a suppression comment or a different typing pattern.

Make sure you are using ty 0.0.18. There are known issues with type narrowing in recent versions.

Regarding add_request, this is related to #1775. There is a possibility that the request will not be added to the queue with some backends.

Thanks for the guidance!

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

Regarding #1775, understood. I'll keep the current add_request return type as ProcessedRequest for now, and it can be adjusted later if the behavior changes as part of that issue.

vdusek · 2026-03-06T09:12:52Z

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

This is wrong. You should have only run uv to upgrade the package to its latest version that is compliant with the pin in the pyproject.toml. Not hard-pin one specific version there.

…one returns and updates the ty dependency constraint.

MrAliHasan · 2026-03-06T10:23:03Z

I've pinned ty to version 0.0.18 in pyproject.toml and updated uv.lock. This resolves the CI type check error caused by the newer ty versions.

This is wrong. You should have only run uv to upgrade the package to its latest version that is compliant with the pin in the pyproject.toml. Not hard-pin one specific version there.

Thanks for the clarification.

I've reverted the hard pin in pyproject.toml and restored the original specifier, then updated the dependency using uv so the resolved version is recorded in uv.lock.

I also addressed the remaining CI issues:

Updated add_request() to explicitly guard against a possible None return for proper type narrowing.
Increased the retry_after delay in test_sleep_when_all_throttled to avoid timing issues observed on Windows CI.

Please let me know if anything else should be adjusted.

Pijukatel · 2026-03-06T13:17:06Z

Only last detail from me: Please update the PR description to match the latest state of the PR, especially that the RequestThrottler is optional.

MrAliHasan · 2026-03-06T14:35:31Z

Only last detail from me: Please update the PR description to match the latest state of the PR, especially that the RequestThrottler is optional.

Thanks! I’ve updated the PR description to reflect the latest state and clarified that the ThrottlingRequestManager is optional.

vdusek

Thanks for the contribution @MrAliHasan, I have a few comments 🙂...

vdusek · 2026-03-09T09:13:30Z

src/crawlee/request_loaders/_throttling_request_manager.py

+                f'Sleeping {sleep_duration:.1f}s until earliest domain is available.'
+            )
+            await asyncio.sleep(sleep_duration)
+            return await self.fetch_next_request()


Could we please use a while loop rather than recursion (because of smaller overhead)?

vdusek · 2026-03-09T09:15:40Z

src/crawlee/request_loaders/_throttling_request_manager.py

+    _BASE_DELAY = timedelta(seconds=2)
+    """Initial delay after the first 429 response from a domain."""
+
+    _MAX_DELAY = timedelta(seconds=60)
+    """Maximum delay between requests to a rate-limited domain."""


Why not make these options configurable via __init__ with these defaults?

src/crawlee/request_loaders/_throttling_request_manager.py

vdusek · 2026-03-09T09:27:08Z

src/crawlee/request_loaders/_throttling_request_manager.py

+        """
+        await self.drop()
+
+        inner = await RequestQueue.open(


It assumes the inner request manager is always a RequestQueue, right? But if the original inner was a RequestManagerTandem or another RequestManager subclass, this recreates it as a RequestQueue.

Good point. The current implementation assumes the inner manager is a RequestQueue, which covers the expected use case. I've documented this in the docstring. If a more generic approach is needed in the future, it can be extended to preserve the original manager type.

I think we should resolve this right ahead in this PR. Your opinion @janbuchar?

Agreed. IMO the most practical way to achieve this would be to require a request_manager_opener callback in the __init__ method. Most often, this would be just RequestQueue.open. You'd need to make the RequestManagerTandem class generic, but that makes a lot of sense anyway.

@MrAliHasan I'm afraid this is still not resolved correctly

tests/unit/test_throttling_request_manager.py

vdusek · 2026-03-09T09:49:57Z

uv.lock


 [[package]]
 name = "ty"
-version = "0.0.17"


There is an even newer version of ty on master; I suggest undoing all updates to the lock file.

…evert uv.lock

MrAliHasan · 2026-03-11T18:12:29Z

Thanks for the detailed review @vdusek! All points have been addressed:

While loop instead of recursion — replaced the recursive fetch_next_request() call with a while True loop.
Configurable delays — base_delay and max_delay are now __init__ parameters with the previous values as defaults.
add_request return type — removed the RuntimeError guard. The return type now matches the base class (ProcessedRequest).
recreate_purged assumption — the current implementation assumes the inner manager is a RequestQueue, which covers the expected use case. This is documented in the docstring and can be extended in the future if needed.
@pytest.mark.asyncio decorators — removed all of them.
uv.lock — reverted all changes.

MrAliHasan · 2026-03-13T12:40:21Z

I've restored uv.lock from upstream master (ty 0.0.21). The previous revert went back too far to ty 0.0.17.

I did notice ty 0.0.21 flags some pre-existing errors in other files (e.g., test_basic_crawler.py, _redis/_utils.py), but none from the throttling manager code. If the type check still fails on CI, then please guide me on how to handle it.

vdusek · 2026-03-13T13:55:34Z

Hi @MrAliHasan, there are still type check issues, could you resolve them?

MrAliHasan · 2026-03-13T14:19:19Z

Hi @MrAliHasan, there are still type check issues, could you resolve them?

Added an explicit None guard in add_request for proper type narrowing. This resolves the invalid-return-type error from ty without using RuntimeError or # type: ignore.

vdusek · 2026-03-13T20:07:31Z

Added an explicit None guard in add_request for proper type narrowing. This resolves the invalid-return-type error from ty without using RuntimeError or # type: ignore.

There are still typer errors. You know you can run the checks locally, right? 🙂

MrAliHasan · 2026-03-13T20:45:21Z

Added an explicit None guard in add_request for proper type narrowing. This resolves the invalid-return-type error from ty without using RuntimeError or # type: ignore.

There are still typer errors. You know you can run the checks locally, right? 🙂

Yes, I do run checks locally! 😄 I run uv run poe type-check (which runs uv run ty check) and all checks pass with ty 0.0.21, including when I manually set --python-version 3.14. I wasn't able to reproduce this error locally on macOS.

It seems the CI (Linux, ubuntu-latest) resolves the type narrowing differently for add_request's return type. The root cause is that ty infers the delegated add_request calls as returning ProcessedRequest | None, while the override signature is -> ProcessedRequest.

Could you suggest the preferred approach here? I see a few options:

Suppress with # type: ignore[return-value]
Change the return type to ProcessedRequest | None to match the base class behavior
Something else you'd prefer?

vdusek · 2026-03-13T20:54:15Z

Change the return type to ProcessedRequest | None to match the base class behavior

👍

vdusek · 2026-03-19T09:24:40Z

uv.lock

Could you please undo all changes of the uv.lock file?

src/crawlee/request_loaders/_throttling_request_manager.py

vdusek · 2026-03-19T09:28:02Z

src/crawlee/request_loaders/_throttling_request_manager.py

+        """
+        await self.drop()
+
+        inner = await RequestQueue.open(


I think we should resolve this right ahead in this PR. Your opinion @janbuchar?

vdusek · 2026-03-19T09:31:06Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+        if isinstance(self._request_manager, ThrottlingRequestManager):
+            crawl_delay = robots_txt_file.get_crawl_delay()
+            if crawl_delay is not None:
+                self._request_manager.set_crawl_delay(url, crawl_delay)


IIUC, this is called for every request, but it's redundant after the first call for a given domain. Could you improve that? (caching or checking if it was already set)

janbuchar · 2026-03-25T17:40:46Z

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

MrAliHasan · 2026-03-25T17:58:14Z

@MrAliHasan there are still some unresolved comments, mainly #1762 (comment) - can you please take care of those?

Yes, I'll work on it tomorrow.

…ard, crawl-delay caching, revert uv.lock

MrAliHasan · 2026-04-03T10:00:53Z

Thanks for the follow-up @vdusek @janbuchar! All comments have been addressed:

Remove type ignore, resolve proper typing: Updated the base class RequestManager.add_request to return ProcessedRequest | None (along with RequestManagerTandem and RequestQueue), so the override in ThrottlingRequestManager no longer needs # type: ignore. Also updated add_requests to handle None.
recreate_purged assumption: Added an explicit isinstance check that raises TypeError if the inner manager is not a RequestQueue, instead of silently assuming it.
Redundant set_crawl_delay calls: Added an early return in set_crawl_delay if the crawl-delay is already set for the domain, making repeated calls a no-op.
uv.lock: Reverted to match master. There are currently merge conflicts due to master moving forward

janbuchar · 2026-04-03T20:00:28Z

docs/guides/request_throttling.mdx

+
+## Overview
+
+The <ApiLink to="class/ThrottlingRequestManager">`ThrottlingRequestManager`</ApiLink> wraps a <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and manages per-domain throttling. You specify which domains to throttle at initialization, and the manager automatically:


I believe it can potentially wrap any RequestManager, no?

You're right, it can wrap any RequestManager. I'll update the docs to reflect that.

janbuchar · 2026-04-03T20:07:45Z

src/crawlee/crawlers/_basic/_basic_crawler.py

-            if purge_request_queue and isinstance(request_manager, RequestQueue):
-                await request_manager.drop()
-                self._request_manager = await RequestQueue.open(
-                    storage_client=self._service_locator.get_storage_client(),
-                    configuration=self._service_locator.get_configuration(),
-                )


Even in the state before the change, this was a code smell - shouldn't we add a "purge_on_start_hook"-like abstract method to RequestManager and implement it in RequestQueue? Or should we just call .drop on request manager?

This is aimed mostly at @vdusek and @Pijukatel. We definitely don't need to resolve it in this PR if you guys don't see an obvious way out.

Understood, happy to leave this for a follow-up.

janbuchar · 2026-04-03T20:09:16Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+            # Record successful request to reset rate limit backoff for this domain.
+            if isinstance(request_manager, ThrottlingRequestManager):
+                request_manager.record_success(request.url)


Can't the "record success" part be implemented in ThrottlingRequestManager.mark_request_as_handled instead? It could probably look at request.state, couldn't it?

Done, moved record_success into ThrottlingRequestManager.mark_request_as_handled and removed the isinstance check from the crawler.

janbuchar · 2026-04-03T20:14:04Z

src/crawlee/request_loaders/_throttling_request_manager.py

+        """
+        await self.drop()
+
+        inner = await RequestQueue.open(


@MrAliHasan I'm afraid this is still not resolved correctly

…to mark_request_as_handled, fix docs

MrAliHasan · 2026-04-03T20:50:33Z

Added a request_manager_opener callback parameter to __init__ (defaults to RequestQueue.open), as you suggested. recreate_purged now uses this callback instead of hardcoding RequestQueue.open, and preserves it across recreations.

vdusek requested review from Pijukatel, janbuchar and vdusek February 23, 2026 11:48

janbuchar requested changes Feb 23, 2026

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

janbuchar self-requested a review February 24, 2026 10:22

janbuchar requested changes Feb 24, 2026

View reviewed changes

refactor: reimplement ThrottlingRequestManager with per-domain sub-…

1065e9b

…queues and update its integration across crawlers.

test: fix typing and linting checks in ThrottlingRequestManager tests

138fd67

Pijukatel reviewed Feb 26, 2026

View reviewed changes

feat: Add explicit domain routing and management to the request throt…

abdf51c

…tling manager.

Pijukatel reviewed Mar 3, 2026

View reviewed changes

feat: Implement recreate_purged for ThrottlingRequestManager and refa…

dd99d9d

…ctor domain state management and sub-queue handling.

janbuchar reviewed Mar 5, 2026

View reviewed changes

src/crawlee/request_loaders/_throttling_request_manager.py Show resolved Hide resolved

deps: Pin ty to version 0.0.18 and update uv.lock to include Python 3…

497b782

….14 wheels for brotlicffi.

fix: Ensure ThrottlingRequestManager.add_request explicitly handles N…

902f885

…one returns and updates the ty dependency constraint.

Pijukatel approved these changes Mar 6, 2026

View reviewed changes

vdusek requested changes Mar 9, 2026

View reviewed changes

refactor: Address reviewer feedback on ThrottlingRequestManager and r…

e02dd68

…evert uv.lock

fix: Restore uv.lock from upstream master

2e3493c

fix: Add type narrowing for add_request to satisfy ty type checker

ac18556

fix: Change add_request return type to ProcessedRequest | None

44b93bb

vdusek requested changes Mar 19, 2026

View reviewed changes

vdusek changed the title ~~fix: add per-domain RequestThrottler for 429 backoff~~ feat: Add opt-in per-domain request throttling for HTTP 429 backoff Mar 19, 2026

vdusek requested review from Mantisus and janbuchar March 19, 2026 09:34

refactor: Address review feedback — proper typing, recreate_purged gu…

412df15

…ard, crawl-delay caching, revert uv.lock

janbuchar reviewed Apr 3, 2026

View reviewed changes

refactor: Add request_manager_opener callback, move record_success in…

a249a23

…to mark_request_as_handled, fix docs

janbuchar self-requested a review April 5, 2026 18:34


		## Overview

		The <ApiLink to="class/ThrottlingRequestManager">`ThrottlingRequestManager`</ApiLink> wraps a <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and manages per-domain throttling. You specify which domains to throttle at initialization, and the manager automatically:

Conversation

MrAliHasan commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes #1437

Problem

Solution

How it works

Usage

Files changed

Tests

Future work

Uh oh!

vdusek commented Feb 23, 2026

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Feb 25, 2026

Uh oh!

MrAliHasan commented Feb 26, 2026

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pijukatel Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Hard-code MemoryStorageClient

Use crawler-specific service_locator as init argument to ThrottlingRequestManager and create all self._sub_queues like this

Use in-memory as default, but allow custom service_locator?

Use in-memory as default, but limit the max size of the in-memory self._sub_queues?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrAliHasan commented Feb 26, 2026

Uh oh!

Pijukatel commented Feb 27, 2026

Uh oh!

Pijukatel commented Mar 2, 2026

Having ad hoc request queues being created for domains can lead to two undesired scenarios:

Preserve existing behavior and introduce RequestThrottler in a safe way

Uh oh!

MrAliHasan commented Mar 2, 2026

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrAliHasan commented Mar 4, 2026

Uh oh!

Uh oh!

Pijukatel commented Mar 5, 2026

Uh oh!

MrAliHasan commented Feb 20, 2026 •

edited

Loading

janbuchar left a comment •

edited

Loading

Pijukatel Feb 26, 2026 •

edited

Loading

Hard-code `MemoryStorageClient`

Use crawler-specific `service_locator` as init argument to `ThrottlingRequestManager` and create all `self._sub_queues` like this

Use in-memory as default, but allow custom `service_locator`?

Preserve existing behavior and introduce `RequestThrottler` in a safe way