Plugin-managed sessions

Note

This page covers plugin-managed sessions, a session management feature built into scrapy-zyte-api. It does not cover the 2 session management features provided natively by Zyte API:

User-managed sessions, which give you full control over session management via the session field.
Zyte-managed sessions, which let Zyte API handle session management for you via the sessionContext field.

You can use both of those Zyte API features directly from scrapy-zyte-api through their corresponding request parameters.

Plugin-managed sessions have an API similar to that of Zyte-managed sessions, but are built on top of user-managed sessions.

Plugin-managed sessions offer some advantages over Zyte-managed sessions:

You can perform session validity checks, so that the sessions of responses that do not pass those checks are refreshed, and the responses retried with a different session.
You can use arbitrary Zyte API parameters for session initialization, beyond those that sessionContextParameters supports.
You have granular control over the session pool size, max errors, etc. See Optimizing sessions and Overriding session configs.

However, plugin-managed sessions are not a replacement for Zyte-managed sessions or user-managed sessions:

Zyte-managed sessions offer a longer life time than the user-managed sessions that plugin-managed sessions use, so as long as you do not need one of the features of plugin-managed sessions, Zyte-managed sessions can be significantly more efficient (fewer session-initialization requests needed per crawl).

Zyte API can also optimize Zyte-managed sessions based on the target website. With plugin-managed sessions, you need to handle optimization yourself.
User-managed sessions offer full control over session management, while plugin-managed sessions remove some of that control to provide an easier API for supported use cases.

Enabling session management

To enable session management for all requests, set ZYTE_API_SESSION_ENABLED to True. You can also toggle session management on or off for specific requests using the zyte_api_session_enabled request metadata key, or override the enabled() method of a session config override.

By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a browser request targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic through request metadata, settings and session config overrides.

For session management to work as expected, session requests must use a retry policy that does not retry 520 and 521 responses, so that the session management middleware can handle those instead.

520 and 521 are Zyte API status codes for download errors (e.g. connection refused). When session management receives a 520 or 521 response, it counts it as a session error, potentially discards the session (see ZYTE_API_SESSION_MAX_ERRORS), and retries the request with a different session. If the retry policy also retried 520 and 521 responses, it would do so before the session middleware can swap the session, potentially reusing the same problematic session for the retry.

scrapy-zyte-api handles this automatically: all requests that are assigned a session get their zyte_api_retry_policy request metadata key set (via setdefault()) to the value of ZYTE_API_SESSION_RETRY_POLICY.

Non-session requests continue to use ZYTE_API_RETRY_POLICY as usual, unaffected by session management.

To override the retry policy for a specific request only, set zyte_api_retry_policy in the request metadata before the request reaches the session middleware. The setdefault() call will not override an already-set value.

Initializing sessions

To change the default session initialization parameters, you have the following options:

To initialize sessions with a given location, use the ZYTE_API_SESSION_LOCATION setting or the zyte_api_session_location request metadata key.

The value should be a dictionary with keys supported by the address field of the setLocation action, e.g.
```
{
    "addressCountry": "US",
    "addressRegion": "NY",
    "postalCode": "10001",
    "streetAddress": "3 Penn Plz",
}
```
By default, the location is set using the setLocation action. A session config override can change that through params().
For session initialization with arbitrary Zyte API request fields, use the ZYTE_API_SESSION_PARAMS setting or the zyte_api_session_params request metadata key.

It works similarly to sessionContextParams from Zyte-managed sessions, but it supports arbitrary Zyte API parameters instead of a specific subset.

If it does not define a "url", the URL of the request triggering a session initialization request will be used.
When defining a session config override, you can customize the default and location-setting session initialization parameters through params().

location() can define a default location for its session config override to use when no location is specified otherwise.

Precedence, from higher to lower, is:

Note

The IP address assigned to a session is determined during session initialization and remains fixed for the lifetime of the session. Using a different geolocation in a follow-up request that reuses a session is not supported and results in undefined behavior.

Checking sessions

Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.

Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.

By default, if the session initialization parameters include actions, and any of them has a returned status in the response (meaning it failed and stopped execution), the session is discarded. Actions with onError set to "continue" that fail produce a continued status instead, and do not cause the session to be discarded. You can disable this behavior by setting ZYTE_API_SESSION_INIT_ACTION_FAILURE_INVALIDATES_SESSION to False.

In addition, if a location is defined through zyte_api_session_location, ZYTE_API_SESSION_LOCATION or location(), and the setLocation action is not available for a given website, the spider is closed with unsupported_set_location as the close reason; in that case, you should define a proper session initialization logic for requests targeting that website.

For sessions initialized without actions, no action-based session check is performed.

To implement your own code to check session responses and determine whether their session should be kept or discarded, use the ZYTE_API_SESSION_CHECKER setting. If you need to check session validity for multiple websites, it is better to define a separate session config override for each website, each with its own implementation of check().

The zyte_api_session_location and zyte_api_session_params request metadata keys, if present in a request that triggers a session initialization request, will be copied into the session initialization request, so that they are available when ZYTE_API_SESSION_CHECKER or check() are called for a session initialization request.

If your session checking implementation relies on the response body (e.g. it uses CSS or XPath expressions), you should make sure that you are getting one, which might not be the case if you are mostly using Zyte API automatic extraction, e.g. when using Zyte spider templates. For example, you can use ZYTE_API_AUTOMAP_PARAMS and ZYTE_API_PROVIDER_PARAMS to force browserHtml or httpResponseBody to be set on every Zyte API request:

setting.py

ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}

Managing pools

scrapy-zyte-api can maintain multiple session pools.

By default, scrapy-zyte-api maintains a separate pool of sessions per domain.

If you use the zyte_api_session_params or zyte_api_session_location request metadata keys, scrapy-zyte-api will automatically use separate session pools within the target domain for those parameters or locations. See pool() for details.

If you want to customize further which pool is assigned to a given request, e.g. to have the same pool for multiple domains or use different pools within the same domain (e.g. for different URL patterns), you can either use the zyte_api_session_pool request metadata key or use the pool() method of session config overrides.

The ZYTE_API_SESSION_POOL_SIZE setting determines the desired number of concurrent, active, working sessions per pool. The ZYTE_API_SESSION_POOLS setting allows defining different values for specific pools.

The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:

When a request asks for a session from a given pool, if the session pool has not yet reached its desired pool size, a session initialization request is triggered. If the session pool has been filled, an existing session is used instead.
When a response associated with a session pool indicates that the session expired, an error over the limit (see ZYTE_API_SESSION_MAX_ERRORS), or a validity check failure over the limit (see ZYTE_API_SESSION_MAX_CHECK_FAILURES), a session initialization request is triggered to replace that session in the session pool.

The session pool assigned to a request affects the fingerprint of the request. 2 requests with a different session pool ID are considered different requests, i.e. not duplicate requests, even if they are otherwise identical.

Optimizing sessions

For faster crawls and lower costs, specially where session initialization requests are more expensive than session usage requests (e.g. scenarios where initialization relies on browserHtml while usage relies on httpResponseBody), you should try to make your sessions live as long as possible before they are discarded.

Here are some things you can try:

On some websites, sending too many requests too fast through a session can cause the target website to ban that session.

On those websites, you can increase ZYTE_API_SESSION_DELAY, ZYTE_API_SESSION_POOL_SIZE, or both, to lower the rate of session reuse.

Mind, however, that user-managed sessions expire after 15 minutes since creation or 2 minutes since the last request (see session). At a certain point, increasing ZYTE_API_SESSION_POOL_SIZE without increasing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN accordingly can be counterproductive.
By default, sessions are discarded as soon as an unsuccessful response is received or a validity check is failed.

However, on some websites sessions may remain valid even after a few unsuccessful responses or validity check failures. If that is the case, you might want to increase the corresponding setting, ZYTE_API_SESSION_MAX_ERRORS or ZYTE_API_SESSION_MAX_CHECK_FAILURES, to require a higher number of the corresponding outcome before discarding a session.

If you do not need session checking and your initialization parameters are only browserHtml and actions, Zyte-managed sessions might be a more cost-effective choice, as they live much longer than user-managed sessions.

Overriding session configs

For spiders that target a single website, using settings and request metadata keys for session initialization and session checking should do the job. However, for broad-crawl spiders, multi-website spiders, to modify session-using requests based on session initialization responses, or for code reusability purposes, you might want to define different session configs for different websites.

The default session config is implemented by the SessionConfig class:

class scrapy_zyte_api.SessionConfig(crawler)[source]

Default session configuration for scrapy-zyte-api sessions.

ADDRESS_FIELDS: list[str] = ['addressCountry', 'addressRegion', 'postalCode', 'streetAddress']: List of address fields to use when available, and their order, when creating a pool ID for a request based on the content of the zyte_api_session_location metadata key. See pool().

check(response: Response, request: Request) → bool[source]

Return True if the session used to fetch response should be kept, return False if it should be discarded, or raise CloseSpider if the spider should be closed.

The default implementation checks the outcome of the setLocation action if a location was defined, and also discards sessions where any action in the initialization response has a returned status (i.e. failed and stopped execution). Both behaviors are described in Checking sessions.

If you need to tell whether request is a session initialization request or not, use is_session_init_request().

See also

LocationSessionConfig

pool(request: Request) → str | PoolConfig[source]

Return the ID of the session pool to use for request, or a dict with additional session pool config.

The main aspects of the default implementation are described in Managing pools.

When the zyte_api_session_params request metadata key is used, the pool ID is the target domain followed by an integer between brackets (e.g. example.com[0]), and a log message indicates which session initialization parameters are associated with that pool ID.

When the zyte_api_session_location request metadata key is used, the pool ID is the target domain followed by an at sign and the comma-separated values of the non-empty fields from ADDRESS_FIELDS (e.g. example.com@US,NY,10001).

Instead of a string, this method can also return a dict containing the pool ID under the id key, and optionally any other key supported by ZYTE_API_SESSION_POOLS. For example:

def pool(self, request):
    if "ecommerce.example" in urlparse_cached(request).netloc:
        return {
            "id": "ecommerce.example",
            "delay": 2.0,
            "size": 16,
        }
    return super().pool(request)

The values of optional keys take precedence over the corresponding pool-independent settings, e.g. delay takes precedence over ZYTE_API_SESSION_DELAY for the corresponding pool ID, but do not override those defined in ZYTE_API_SESSION_POOLS.

For any given pool ID, the values of optional keys are only taken into account when the pool ID is first encountered. You cannot use this method to change them at run time.

process_request(request: Request) → Request | None[source]

Process request after it has been assigned a session.

Return None to send the request as is, or return a new request object to replace the original request.

The default implementation does not modify the request.

You can combine this method and check() to modify requests based on session initialization responses. For example:

In __init__(), create a dictionary to store session data:

def __init__(self, crawler):
    super().__init__(crawler)
    self.session_data = {}

In check(), store data from the session initialization response in session_data:

def check(self, response: Response, request: Request) -> bool:
    if scrapy_zyte_api.is_session_init_request(request):
        session_id = scrapy_zyte_api.get_request_session_id(request)
        self.session_data[session_id] = {
            "csrf_token": response.css(".csrf-token::text").get(),
        }
    return super().check(response, request)

In process_request(), read the session data and act accordingly, either modifying the request in place where possible, e.g.:

def process_request(self, request: Request) -> Optional[Request]:
    session_id = scrapy_zyte_api.get_request_session_id(request)
    csrf_token = self.session_data[session_id]["csrf_token"]
    request.headers["CSRF-Token"] = csrf_token

Or returning an entirely new request, e.g.:

def process_request(self, request: Request) -> Optional[Request]:
    session_id = get_request_session_id(request)
    csrf_token = self.session_data[session_id]["csrf_token"]
    new_url = w3lib.url.add_or_replace_parameter(request.url, "csrf_token", csrf_token)
    return request.replace(url=new_url)

To define a different session config for a given URL pattern, install web-poet and define a subclass of SessionConfig decorated with session_config():

scrapy_zyte_api.session_config(include: str | ~collections.abc.Iterable[str], *, instead_of: type[~scrapy_zyte_api._session.SessionConfig] | None = <class 'scrapy_zyte_api._session.SessionConfig'>, exclude: str | ~collections.abc.Iterable[str] | None = None, priority: int = 500, **kwargs)

If you only need to override the SessionConfig.check or SessionConfig.params methods for scenarios involving a location, you may subclass LocationSessionConfig instead:

class scrapy_zyte_api.LocationSessionConfig(crawler)[source]

SessionConfig subclass to minimize boilerplate when implementing location-specific session configs, i.e. session configs where the default values should be used unless a location is set.

Provides counterparts to some SessionConfig methods that are only called when a location is set, and get that location as a parameter.

location_check(response: Response, request: Request, location: dict[str, Any]) → bool[source]: Like SessionConfig.check, but it is only called when a location is set, and gets that location as a parameter.

location_params(request: Request, location: dict[str, Any]) → dict[str, Any] | Awaitable[dict[str, Any]][source]: Like SessionConfig.params, but it is only called when a location is set, and gets that location as a parameter.

If in a session config implementation or in any other Scrapy component you need to tell whether a request is a session initialization request or not, use is_session_init_request():

scrapy_zyte_api.is_session_init_request(request)[source]: Return True if the request is a session initialization request or False otherwise.

To get the session ID of a given request, use:

scrapy_zyte_api.get_request_session_id(request: Request) → str | None[source]: Return the session ID of request, or None if it does not have a session ID assigned.

Classes decorated with session_config() are registered into session_config_registry:

scrapy_zyte_api.session_config_registry: Instance of web_poet.rules.RulesRegistry that holds session configs.

Session retry policies

The following retry policies are designed to work well with session management (see Enabling session management). They are meant for ZYTE_API_SESSION_RETRY_POLICY:

scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY: Alternative to the default retry policy for session management that does not retry 520 responses.

scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY: Alternative to the aggresive retry policy for session management that does not retry 520 and 521 responses.

Note

When using python-zyte-api 0.5.2 or lower, this is the same as SESSION_DEFAULT_RETRY_POLICY.

Spider closers

Session management can close your spider early in the following scenarios:

bad_session_inits: Too many session initializations failed in a row for a given session pool.

You can use the ZYTE_API_SESSION_MAX_BAD_INITS and ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL settings to adjust that maximum.
pool_error: There was an error determining the session pool ID for some request.

It is most likely the result of a bad implementation of SessionConfig.pool; the logs should contain an error message with a traceback for such errors.
unsupported_set_location: You used ZYTE_API_SESSION_LOCATION or zyte_api_session_location to configure session initialization with the setLocation action, but Zyte API does not yet support setLocation for the target website.

A custom SessionConfig.check implementation may also close your spider with a custom reason by raising a CloseSpider exception.

Troubleshooting

RuntimeError: Could not get a session ID

If you see this exception, indicating that after a given number of attempts, with a given minimum wait time between attempts, it was not possible to get a session ID from the session rotation queue, consider the following possibilities:

A bug in your session validation code may be causing it to return False for a valid response.

This is specially likely if you see this issue for very few, specific requests, while most requests work fine.
The values of the ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS and ZYTE_API_SESSION_QUEUE_WAIT_TIME settings may be too low for your scenario, in which case you can modify them accordingly.