Session management

Zyte API provides powerful session APIs:

When using scrapy-zyte-api, you can use these session APIs through the corresponding Zyte API fields (session, sessionContext).

However, scrapy-zyte-api also provides its own session management API, similar to that of server-managed sessions, but built on top of client-managed sessions.

scrapy-zyte-api session management offers some advantages over server-managed sessions:

However, scrapy-zyte-api session management is not a replacement for server-managed sessions or client-managed sessions:

  • Server-managed sessions offer a longer life time than the client-managed sessions that scrapy-zyte-api session management uses, so as long as you do not need one of the scrapy-zyte-api session management features, server-managed sessions can be significantly more efficient (fewer total sessions needed per crawl).

    Zyte API can also optimize server-managed sessions based on the target website. With scrapy-zyte-api session management, you need to handle optimization yourself.

  • Client-managed sessions offer full control over session management, while scrapy-zyte-api session management removes some of that control to provide an easier API for supported use cases.

Enabling session management

To enable session management for all requests, set ZYTE_API_SESSION_ENABLED to True. You can also toggle session management on or off for specific requests using the zyte_api_session_enabled request metadata key, or override the enabled() method of a session config override.

By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a browser request targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic though request metadata, settings and session config overrides.

For session management to work as expected, your ZYTE_API_RETRY_POLICY should not retry 520 and 521 responses:

Initializing sessions

To change the default session initialization parameters, you have the following options:

Precedence, from higher to lower, is:

  1. zyte_api_session_params

  2. zyte_api_session_location

  3. ZYTE_API_SESSION_PARAMS

  4. ZYTE_API_SESSION_LOCATION

  5. location()

  6. params()

Checking sessions

Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.

Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.

By default, if a location is defined through zyte_api_session_location, ZYTE_API_SESSION_LOCATION or location(), even if the parameters used for session initialization actually come from zyte_api_session_params or ZYTE_API_SESSION_LOCATION, the outcome of the first setLocation action used, if any, is checked. If the action fails, the session is discarded. If the action is not even available for a given website, the spider is closed with unsupported_set_location as the close reason; in that case, you should define a proper session initialization logic for requests targeting that website.

For sessions initialized without a configured location, no session check is performed, sessions are assumed to be fine until they expire or are banned. That is so even if session initialization parameters include a setLocation action.

To implement your own code to check session responses and determine whether their session should be kept or discarded, use the ZYTE_API_SESSION_CHECKER setting. If you need to check session validity for multiple websites, it is better to define a separate session config override for each website, each with its own implementation of check().

The zyte_api_session_location and zyte_api_session_params request metadata keys, if present in a request that triggers a session initialization request, will be copied into the session initialization request, so that they are available when ZYTE_API_SESSION_CHECKER or check() are called for a session initialization request.

If your session checking implementation relies on the response body (e.g. it uses CSS or XPath expressions), you should make sure that you are getting one, which might not be the case if you are mostly using Zyte API automatic extraction, e.g. when using Zyte spider templates. For example, you can use ZYTE_API_AUTOMAP_PARAMS and ZYTE_API_PROVIDER_PARAMS to force browserHtml or httpResponseBody to be set on every Zyte API request:

setting.py
ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}

Managing pools

scrapy-zyte-api can maintain multiple session pools.

By default, scrapy-zyte-api maintains a separate pool of sessions per domain.

If you use the zyte_api_session_params or zyte_api_session_location request metadata keys, scrapy-zyte-api will automatically use separate session pools within the target domain for those parameters or locations. See pool() for details.

If you want to customize further which pool is assigned to a given request, e.g. to have the same pool for multiple domains or use different pools within the same domain (e.g. for different URL patterns), you can either use the zyte_api_session_pool request metadata key or use the pool() method of session config overrides.

The ZYTE_API_SESSION_POOL_SIZE setting determines the desired number of concurrent, active, working sessions per pool. The ZYTE_API_SESSION_POOL_SIZES setting allows defining different values for specific pools.

The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:

Optimizing sessions

For faster crawls and lower costs, specially where session initialization requests are more expensive than session usage requests (e.g. scenarios where initialization relies on browserHtml while usage relies on httpResponseBody), you should try to make your sessions live as long as possible before they are discarded.

Here are some things you can try:

If you do not need session checking and your initialization parameters are only browserHtml and actions, server-managed sessions might be a more cost-effective choice, as they live much longer than client-managed sessions.

Overriding session configs

For spiders that target a single website, using settings and request metadata keys for session initialization and session checking should do the job. However, for broad-crawl spiders, multi-website spiders, or for code reusability purposes, you might want to define different session configs for different websites.

The default session config is implemented by the SessionConfig class:

class scrapy_zyte_api.SessionConfig(crawler)[source]

Default session configuration for scrapy-zyte-api sessions.

ADDRESS_FIELDS: List[str] = ['addressCountry', 'addressRegion', 'postalCode', 'streetAddress']

List of address fields to use when available, and their order, when creating a pool ID for a request based on the content of the zyte_api_session_location metadata key. See pool().

check(response: Response, request: Request) bool[source]

Return True if the session used to fetch response should be kept, return False if it should be discarded, or raise CloseSpider if the spider should be closed.

The default implementation checks the outcome of the setLocation action if a location was defined, as described in Checking sessions.

If you need to tell whether request is a session initialization request or not, use is_session_init_request().

enabled(request: Request) bool[source]

Return True if the request should use sessions from session management or False otherwise.

The default implementation is based on settings and request metadata keys as described in Enabling session management.

location(request: Request) Dict[str, str][source]

Return the address dict to use for location-based session initialization for request.

The default implementation is based on settings and request metadata keys as described in Initializing sessions.

When overriding this method, you should only return a custom value if the default implementation returns an empty dict, e.g.

def location(self, request: Request) -> Dict[str, str]:
    fallback = {"addressCountry": "US", "addressRegion": "NY", "postalCode": "10001"}
    return super().location(request) or fallback

Note

An implementation of location() can technically override zyte_api_session_location or ZYTE_API_SESSION_LOCATION, but it is not recommended as it breaks the precedence chain that users expect.

You should only override this method if you need a location to be used even when no location is specified through request metadata or settings. It can be specially useful if you can determine the right location to use based on the request, e.g.

def location(self, request: Request) -> Dict[str, str]:
    fallback = {}
    if postal_code := w3lib.url.url_query_parameter(request.url, "postalCode"):
        fallback["postalCode"] = postal_code
    return super().location(request) or fallback

Same as with zyte_api_session_location and ZYTE_API_SESSION_LOCATION, the returned location fields should match those of the address parameter of the setLocation action where possible, even when using an implementation of params() that does not rely on the setLocation action.

params(request: Request) Dict[str, Any][source]

Return the Zyte API request parameters to use to initialize a session for request.

The default implementation is based on settings and request metadata keys as described in Initializing sessions.

When overriding this method, you should return parameters for the target location, i.e. the output of location(), unless that output is an empty dict, e.g.

def params(self, request: Request) -> Dict[str, Any]:
    if location := self.location(request):
        return {
            "url": "https://example.com/new-session/for-country",
            "httpResponseBody": True,
            "httpRequestMethod": "POST",
            "httpRequestText": location["addressCountry"],
        }
    return {
        "url": "https://example.com/new-session",
        "httpResponseBody": True,
    }

The returned parameters do not need to include url. If missing, it is picked from the request triggering a session initialization request.

pool(request: Request) str[source]

Return the ID of the session pool to use for request.

The main aspects of the default implementation are described in Managing pools.

When the zyte_api_session_params request metadata key is used, the pool ID is the target domain followed by an integer between brackets (e.g. example.com[0]), and a log message indicates which session initialization parameters are associated with that pool ID.

When the zyte_api_session_location request metadata key is used, the pool ID is the target domain followed by an at sign and the comma-separated values of the non-empty fields from ADDRESS_FIELDS (e.g. example.com@US,NY,10001).

To define a different session config for a given URL pattern, install web-poet and define a subclass of SessionConfig decorated with session_config():

scrapy_zyte_api.session_config(include, *, instead_of: ~typing.Type | None = <class 'scrapy_zyte_api._session.SessionConfig'>, exclude=None, priority: int = 500, **kwargs)

Mark the decorated SessionConfig subclass as the session config to use for the specified URL patterns.

Usage example:

from typing import Any, Dict

from scrapy import Request
from scrapy.http.response import Response
from scrapy_zyte_api import SessionConfig, session_config


@session_config(["ecommerce.de.example, ecommerce.us.example"])
class EcommerceExampleSessionConfig(SessionConfig):

    def pool(self, request: Request) -> str:
        return "ecommerce.example"

    def params(self, request: Request) -> Dict[str, Any]:
        if location := self.location(request):
            return {
                "url": request.url,
                "browserHtml": True,
                "actions": [
                    {
                        "action": "type",
                        "selector": {"type": "css", "value": ".zipcode"},
                        "text": location["postalCode"],
                    },
                    {
                        "action": "click",
                        "selector": {"type": "css", "value": "[type='submit']"},
                    },
                ],
            }
        return super().params(request)

    def check(self, response: Response, request: Request) -> bool:
        if location := self.location(request):
            return response.css(".zipcode::text").get() == location["postalCode"]
        return super().check(response, request)

Your SessionConfig subclass must be defined in a module that gets imported at run time. See SCRAPY_POET_DISCOVER in the scrapy-poet setting reference.

The parameters of this decorator are those of web_poet.handle_urls(), only instead_of is SessionConfig by default, to_return is not supported, and session configs are registered in their own rule registry.

If you only need to override the SessionConfig.check or SessionConfig.params methods for scenarios involving a location, you may subclass LocationSessionConfig instead:

class scrapy_zyte_api.LocationSessionConfig(crawler)[source]

SessionConfig subclass to minimize boilerplate when implementing location-specific session configs, i.e. session configs where the default values should be used unless a location is set.

Provides counterparts to some SessionConfig methods that are only called when a location is set, and get that location as a parameter.

location_check(response: Response, request: Request, location: Dict[str, Any]) bool[source]

Like SessionConfig.check, but it is only called when a location is set, and gets that location as a parameter.

location_params(request: Request, location: Dict[str, Any]) Dict[str, Any][source]

Like SessionConfig.params, but it is only called when a location is set, and gets that location as a parameter.

If in a session config implementation or in any other Scrapy component you need to tell whether a request is a session initialization request or not, use is_session_init_request():

scrapy_zyte_api.is_session_init_request(request)[source]

Return True if the request is a session initialization request or False otherwise.

Classes decorated with session_config() are registered into session_config_registry:

scrapy_zyte_api.session_config_registry

Instance of web_poet.rules.RulesRegistry that holds session configs.

Session retry policies

The following retry policies are designed to work well with session management (see Enabling session management):

scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY

Alternative to the default retry policy for session management that does not retry 520 responses.

scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY

Alternative to the aggresive retry policy for session management that does not retry 520 and 521 responses.

Note

When using python-zyte-api 0.5.2 or lower, this is the same as SESSION_DEFAULT_RETRY_POLICY.

Spider closers

Session management can close your spider early in the following scenarios:

A custom SessionConfig.check implementation may also close your spider with a custom reason by raising a CloseSpider exception.

Session stats

The following stats exist for scrapy-zyte-api session management:

scrapy-zyte-api/sessions/pools/{pool}/init/check-error

Number of times that a session for pool {pool} triggered an unexpected exception during its session validation check right after initialization.

It is most likely the result of a bad implementation of SessionConfig.check; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/init/check-failed

Number of times that a session from pool {pool} failed its session validation check right after initialization.

scrapy-zyte-api/sessions/pools/{pool}/init/check-passed

Number of times that a session from pool {pool} passed its session validation check right after initialization.

scrapy-zyte-api/sessions/pools/{pool}/init/failed

Number of times that initializing a session for pool {pool} resulted in an unsuccessful response.

scrapy-zyte-api/sessions/pools/{pool}/init/param-error

Number of times that initializing a session for pool {pool} triggered an unexpected exception when obtaining the Zyte API parameters for session initialization.

It is most likely the result of a bad implementation of SessionConfig.params; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/use/check-error

Number of times that a response that used a session from pool {pool} triggered an unexpected exception during its session validation check.

It is most likely the result of a bad implementation of SessionConfig.check; the logs should contain an error message with a traceback for such errors.

scrapy-zyte-api/sessions/pools/{pool}/use/check-failed

Number of times that a response that used a session from pool {pool} failed its session validation check.

scrapy-zyte-api/sessions/pools/{pool}/use/check-passed

Number of times that a response that used a session from pool {pool} passed its session validation check.

scrapy-zyte-api/sessions/pools/{pool}/use/expired

Number of times that a session from pool {pool} expired.

scrapy-zyte-api/sessions/pools/{pool}/use/failed

Number of times that a request that used a session from pool {pool} got an unsuccessful response.

scrapy-zyte-api/sessions/use/disabled

Number of processed requests for which session management was disabled.