Plugin-managed sessions

Note

This page covers plugin-managed sessions, a session management feature built into scrapy-zyte-api. It does not cover the 2 session management features provided natively by Zyte API:

You can use both of those Zyte API features directly from scrapy-zyte-api through their corresponding request parameters.

Plugin-managed sessions have an API similar to that of Zyte-managed sessions, but are built on top of user-managed sessions.

Plugin-managed sessions offer some advantages over Zyte-managed sessions:

However, plugin-managed sessions are not a replacement for Zyte-managed sessions or user-managed sessions:

  • Zyte-managed sessions offer a longer life time than the user-managed sessions that plugin-managed sessions use, so as long as you do not need one of the features of plugin-managed sessions, Zyte-managed sessions can be significantly more efficient (fewer session-initialization requests needed per crawl).

    Zyte API can also optimize Zyte-managed sessions based on the target website. With plugin-managed sessions, you need to handle optimization yourself.

  • User-managed sessions offer full control over session management, while plugin-managed sessions remove some of that control to provide an easier API for supported use cases.

Enabling session management

To enable session management for all requests, set ZYTE_API_SESSION_ENABLED to True. You can also toggle session management on or off for specific requests using the zyte_api_session_enabled request metadata key, or override the enabled() method of a session config override.

By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a browser request targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic through request metadata, settings and session config overrides.

For session management to work as expected, session requests must use a retry policy that does not retry 520 and 521 responses, so that the session management middleware can handle those instead.

520 and 521 are Zyte API status codes for download errors (e.g. connection refused). When session management receives a 520 or 521 response, it counts it as a session error, potentially discards the session (see ZYTE_API_SESSION_MAX_ERRORS), and retries the request with a different session. If the retry policy also retried 520 and 521 responses, it would do so before the session middleware can swap the session, potentially reusing the same problematic session for the retry.

scrapy-zyte-api handles this automatically: all requests that are assigned a session get their zyte_api_retry_policy request metadata key set (via setdefault()) to the value of ZYTE_API_SESSION_RETRY_POLICY.

Non-session requests continue to use ZYTE_API_RETRY_POLICY as usual, unaffected by session management.

To override the retry policy for a specific request only, set zyte_api_retry_policy in the request metadata before the request reaches the session middleware. The setdefault() call will not override an already-set value.

Initializing sessions

To change the default session initialization parameters, you have the following options:

Precedence, from higher to lower, is:

  1. zyte_api_session_params

  2. zyte_api_session_location

  3. ZYTE_API_SESSION_PARAMS

  4. ZYTE_API_SESSION_LOCATION

  5. location()

  6. params()

Note

The IP address assigned to a session is determined during session initialization and remains fixed for the lifetime of the session. Using a different geolocation in a follow-up request that reuses a session is not supported and results in undefined behavior.

Checking sessions

Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.

Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.

By default, if the session initialization parameters include actions, and any of them has a returned status in the response (meaning it failed and stopped execution), the session is discarded. Actions with onError set to "continue" that fail produce a continued status instead, and do not cause the session to be discarded. You can disable this behavior by setting ZYTE_API_SESSION_INIT_ACTION_FAILURE_INVALIDATES_SESSION to False.

In addition, if a location is defined through zyte_api_session_location, ZYTE_API_SESSION_LOCATION or location(), and the setLocation action is not available for a given website, the spider is closed with unsupported_set_location as the close reason; in that case, you should define a proper session initialization logic for requests targeting that website.

For sessions initialized without actions, no action-based session check is performed.

To implement your own code to check session responses and determine whether their session should be kept or discarded, use the ZYTE_API_SESSION_CHECKER setting. If you need to check session validity for multiple websites, it is better to define a separate session config override for each website, each with its own implementation of check().

The zyte_api_session_location and zyte_api_session_params request metadata keys, if present in a request that triggers a session initialization request, will be copied into the session initialization request, so that they are available when ZYTE_API_SESSION_CHECKER or check() are called for a session initialization request.

If your session checking implementation relies on the response body (e.g. it uses CSS or XPath expressions), you should make sure that you are getting one, which might not be the case if you are mostly using Zyte API automatic extraction, e.g. when using Zyte spider templates. For example, you can use ZYTE_API_AUTOMAP_PARAMS and ZYTE_API_PROVIDER_PARAMS to force browserHtml or httpResponseBody to be set on every Zyte API request:

setting.py
ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}

Managing pools

scrapy-zyte-api can maintain multiple session pools.

By default, scrapy-zyte-api maintains a separate pool of sessions per domain.

If you use the zyte_api_session_params or zyte_api_session_location request metadata keys, scrapy-zyte-api will automatically use separate session pools within the target domain for those parameters or locations. See pool() for details.

If you want to customize further which pool is assigned to a given request, e.g. to have the same pool for multiple domains or use different pools within the same domain (e.g. for different URL patterns), you can either use the zyte_api_session_pool request metadata key or use the pool() method of session config overrides.

The ZYTE_API_SESSION_POOL_SIZE setting determines the desired number of concurrent, active, working sessions per pool. The ZYTE_API_SESSION_POOLS setting allows defining different values for specific pools.

The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:

The session pool assigned to a request affects the fingerprint of the request. 2 requests with a different session pool ID are considered different requests, i.e. not duplicate requests, even if they are otherwise identical.

Optimizing sessions

For faster crawls and lower costs, specially where session initialization requests are more expensive than session usage requests (e.g. scenarios where initialization relies on browserHtml while usage relies on httpResponseBody), you should try to make your sessions live as long as possible before they are discarded.

Here are some things you can try:

If you do not need session checking and your initialization parameters are only browserHtml and actions, Zyte-managed sessions might be a more cost-effective choice, as they live much longer than user-managed sessions.

Overriding session configs

For spiders that target a single website, using settings and request metadata keys for session initialization and session checking should do the job. However, for broad-crawl spiders, multi-website spiders, to modify session-using requests based on session initialization responses, or for code reusability purposes, you might want to define different session configs for different websites.

The default session config is implemented by the SessionConfig class:

class scrapy_zyte_api.SessionConfig(crawler)[source]

Default session configuration for scrapy-zyte-api sessions.

ADDRESS_FIELDS: list[str] = ['addressCountry', 'addressRegion', 'postalCode', 'streetAddress']

List of address fields to use when available, and their order, when creating a pool ID for a request based on the content of the zyte_api_session_location metadata key. See pool().

check(response: Response, request: Request) bool[source]

Return True if the session used to fetch response should be kept, return False if it should be discarded, or raise CloseSpider if the spider should be closed.

The default implementation checks the outcome of the setLocation action if a location was defined, and also discards sessions where any action in the initialization response has a returned status (i.e. failed and stopped execution). Both behaviors are described in Checking sessions.

If you need to tell whether request is a session initialization request or not, use is_session_init_request().

enabled(request: Request) bool[source]

Return True if the request should use sessions from session management or False otherwise.

The default implementation is based on settings and request metadata keys as described in Enabling session management.

location(request: Request) dict[str, str][source]

Return the address dict to use for location-based session initialization for request.

The default implementation is based on settings and request metadata keys as described in Initializing sessions.

When overriding this method, you should only return a custom value if the default implementation returns an empty dict, e.g.

def location(self, request: Request) -> Dict[str, str]:
    fallback = {"addressCountry": "US", "addressRegion": "NY", "postalCode": "10001"}
    return super().location(request) or fallback

Note

An implementation of location() can technically override zyte_api_session_location or ZYTE_API_SESSION_LOCATION, but it is not recommended as it breaks the precedence chain that users expect.

You should only override this method if you need a location to be used even when no location is specified through request metadata or settings. It can be specially useful if you can determine the right location to use based on the request, e.g.

def location(self, request: Request) -> Dict[str, str]:
    fallback = {}
    if postal_code := w3lib.url.url_query_parameter(request.url, "postalCode"):
        fallback["postalCode"] = postal_code
    return super().location(request) or fallback

Same as with zyte_api_session_location and ZYTE_API_SESSION_LOCATION, the returned location fields should match those of the address parameter of the setLocation action where possible, even when using an implementation of params() that does not rely on the setLocation action.

params(request: Request) dict[str, Any] | Awaitable[dict[str, Any]][source]

Return the Zyte API request parameters to use to initialize a session for request.

The default implementation is based on settings and request metadata keys as described in Initializing sessions.

When overriding this method, you should return parameters for the target location, i.e. the output of location(), unless that output is an empty dict, e.g.

def params(self, request: Request) -> Dict[str, Any]:
    if location := self.location(request):
        return {
            "url": "https://example.com/new-session/for-country",
            "httpResponseBody": True,
            "httpRequestMethod": "POST",
            "httpRequestText": location["addressCountry"],
        }
    return {
        "url": "https://example.com/new-session",
        "httpResponseBody": True,
    }

The returned parameters do not need to include url. If missing, it is picked from the request triggering a session initialization request.

This method can be implemented as a coroutine function. For example:

async def params(self, request: Request) -> Dict[str, Any]:
    bootstrap_request = Request(
        "https://example.com/api/get-session",
        meta={
            "zyte_api_session_enabled": False,
            "zyte_api": {
                "httpResponseBody": True,
                "responseCookies": True,
            },
        },
    )
    response = await self.crawler.engine.download_async(bootstrap_request)
    return {
        "url": "https://example.com/new-session",
        "httpResponseBody": True,
        "requestCookies": response.raw_api_response["responseCookies"],
    }
pool(request: Request) str | PoolConfig[source]

Return the ID of the session pool to use for request, or a dict with additional session pool config.

The main aspects of the default implementation are described in Managing pools.

When the zyte_api_session_params request metadata key is used, the pool ID is the target domain followed by an integer between brackets (e.g. example.com[0]), and a log message indicates which session initialization parameters are associated with that pool ID.

When the zyte_api_session_location request metadata key is used, the pool ID is the target domain followed by an at sign and the comma-separated values of the non-empty fields from ADDRESS_FIELDS (e.g. example.com@US,NY,10001).

Instead of a string, this method can also return a dict containing the pool ID under the id key, and optionally any other key supported by ZYTE_API_SESSION_POOLS. For example:

def pool(self, request):
    if "ecommerce.example" in urlparse_cached(request).netloc:
        return {
            "id": "ecommerce.example",
            "delay": 2.0,
            "size": 16,
        }
    return super().pool(request)

The values of optional keys take precedence over the corresponding pool-independent settings, e.g. delay takes precedence over ZYTE_API_SESSION_DELAY for the corresponding pool ID, but do not override those defined in ZYTE_API_SESSION_POOLS.

For any given pool ID, the values of optional keys are only taken into account when the pool ID is first encountered. You cannot use this method to change them at run time.

process_request(request: Request) Request | None[source]

Process request after it has been assigned a session.

Return None to send the request as is, or return a new request object to replace the original request.

The default implementation does not modify the request.

You can combine this method and check() to modify requests based on session initialization responses. For example:

  1. In __init__(), create a dictionary to store session data:

    def __init__(self, crawler):
        super().__init__(crawler)
        self.session_data = {}
    
  2. In check(), store data from the session initialization response in session_data:

    def check(self, response: Response, request: Request) -> bool:
        if scrapy_zyte_api.is_session_init_request(request):
            session_id = scrapy_zyte_api.get_request_session_id(request)
            self.session_data[session_id] = {
                "csrf_token": response.css(".csrf-token::text").get(),
            }
        return super().check(response, request)
    
  3. In process_request(), read the session data and act accordingly, either modifying the request in place where possible, e.g.:

    def process_request(self, request: Request) -> Optional[Request]:
        session_id = scrapy_zyte_api.get_request_session_id(request)
        csrf_token = self.session_data[session_id]["csrf_token"]
        request.headers["CSRF-Token"] = csrf_token
    

    Or returning an entirely new request, e.g.:

    def process_request(self, request: Request) -> Optional[Request]:
        session_id = get_request_session_id(request)
        csrf_token = self.session_data[session_id]["csrf_token"]
        new_url = w3lib.url.add_or_replace_parameter(request.url, "csrf_token", csrf_token)
        return request.replace(url=new_url)
    

To define a different session config for a given URL pattern, install web-poet and define a subclass of SessionConfig decorated with session_config():

scrapy_zyte_api.session_config(include: str | ~collections.abc.Iterable[str], *, instead_of: type[~scrapy_zyte_api._session.SessionConfig] | None = <class 'scrapy_zyte_api._session.SessionConfig'>, exclude: str | ~collections.abc.Iterable[str] | None = None, priority: int = 500, **kwargs)

If you only need to override the SessionConfig.check or SessionConfig.params methods for scenarios involving a location, you may subclass LocationSessionConfig instead:

class scrapy_zyte_api.LocationSessionConfig(crawler)[source]

SessionConfig subclass to minimize boilerplate when implementing location-specific session configs, i.e. session configs where the default values should be used unless a location is set.

Provides counterparts to some SessionConfig methods that are only called when a location is set, and get that location as a parameter.

location_check(response: Response, request: Request, location: dict[str, Any]) bool[source]

Like SessionConfig.check, but it is only called when a location is set, and gets that location as a parameter.

location_params(request: Request, location: dict[str, Any]) dict[str, Any] | Awaitable[dict[str, Any]][source]

Like SessionConfig.params, but it is only called when a location is set, and gets that location as a parameter.

If in a session config implementation or in any other Scrapy component you need to tell whether a request is a session initialization request or not, use is_session_init_request():

scrapy_zyte_api.is_session_init_request(request)[source]

Return True if the request is a session initialization request or False otherwise.

To get the session ID of a given request, use:

scrapy_zyte_api.get_request_session_id(request: Request) str | None[source]

Return the session ID of request, or None if it does not have a session ID assigned.

Classes decorated with session_config() are registered into session_config_registry:

scrapy_zyte_api.session_config_registry

Instance of web_poet.rules.RulesRegistry that holds session configs.

Session retry policies

The following retry policies are designed to work well with session management (see Enabling session management). They are meant for ZYTE_API_SESSION_RETRY_POLICY:

scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY

Alternative to the default retry policy for session management that does not retry 520 responses.

scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY

Alternative to the aggresive retry policy for session management that does not retry 520 and 521 responses.

Note

When using python-zyte-api 0.5.2 or lower, this is the same as SESSION_DEFAULT_RETRY_POLICY.

Spider closers

Session management can close your spider early in the following scenarios:

A custom SessionConfig.check implementation may also close your spider with a custom reason by raising a CloseSpider exception.

Troubleshooting

RuntimeError: Could not get a session ID

If you see this exception, indicating that after a given number of attempts, with a given minimum wait time between attempts, it was not possible to get a session ID from the session rotation queue, consider the following possibilities:

  • A bug in your session validation code may be causing it to return False for a valid response.

    This is specially likely if you see this issue for very few, specific requests, while most requests work fine.

  • The values of the ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS and ZYTE_API_SESSION_QUEUE_WAIT_TIME settings may be too low for your scenario, in which case you can modify them accordingly.