Plugin-managed sessions
Note
This page covers plugin-managed sessions, a session management feature built into scrapy-zyte-api. It does not cover the 2 session management features provided natively by Zyte API:
User-managed sessions, which give you full control over session management via the session field.
Zyte-managed sessions, which let Zyte API handle session management for you via the sessionContext field.
You can use both of those Zyte API features directly from scrapy-zyte-api through their corresponding request parameters.
Plugin-managed sessions have an API similar to that of Zyte-managed sessions, but are built on top of user-managed sessions.
Plugin-managed sessions offer some advantages over Zyte-managed sessions:
You can perform session validity checks, so that the sessions of responses that do not pass those checks are refreshed, and the responses retried with a different session.
You can use arbitrary Zyte API parameters for session initialization, beyond those that sessionContextParameters supports.
You have granular control over the session pool size, max errors, etc. See Optimizing sessions and Overriding session configs.
However, plugin-managed sessions are not a replacement for Zyte-managed sessions or user-managed sessions:
Zyte-managed sessions offer a longer life time than the user-managed sessions that plugin-managed sessions use, so as long as you do not need one of the features of plugin-managed sessions, Zyte-managed sessions can be significantly more efficient (fewer session-initialization requests needed per crawl).
Zyte API can also optimize Zyte-managed sessions based on the target website. With plugin-managed sessions, you need to handle optimization yourself.
User-managed sessions offer full control over session management, while plugin-managed sessions remove some of that control to provide an easier API for supported use cases.
Enabling session management
To enable session management for all requests, set
ZYTE_API_SESSION_ENABLED to True. You can also toggle session
management on or off for specific requests using the
zyte_api_session_enabled request metadata key, or override the
enabled() method of a session config
override.
By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a browser request targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic through request metadata, settings and session config overrides.
For session management to work as expected, session requests must use a retry policy that does not retry 520 and 521 responses, so that the session management middleware can handle those instead.
520 and 521 are Zyte API status codes for download errors (e.g. connection
refused). When session management receives a 520 or 521 response, it counts it
as a session error, potentially discards the session (see
ZYTE_API_SESSION_MAX_ERRORS), and retries the request with a
different session. If the retry policy also retried 520 and 521 responses, it
would do so before the session middleware can swap the session, potentially
reusing the same problematic session for the retry.
scrapy-zyte-api handles this automatically: all requests that are assigned a
session get their zyte_api_retry_policy request metadata key set
(via setdefault()) to the value of
ZYTE_API_SESSION_RETRY_POLICY.
Non-session requests continue to use ZYTE_API_RETRY_POLICY as usual,
unaffected by session management.
To override the retry policy for a specific request only, set
zyte_api_retry_policy in the request metadata before the request
reaches the session middleware. The setdefault() call will not
override an already-set value.
Initializing sessions
To change the default session initialization parameters, you have the following options:
To initialize sessions with a given location, use the
ZYTE_API_SESSION_LOCATIONsetting or thezyte_api_session_locationrequest metadata key.The value should be a dictionary with keys supported by the
addressfield of thesetLocationaction, e.g.{ "addressCountry": "US", "addressRegion": "NY", "postalCode": "10001", "streetAddress": "3 Penn Plz", }
By default, the location is set using the
setLocationaction. A session config override can change that throughparams().For session initialization with arbitrary Zyte API request fields, use the
ZYTE_API_SESSION_PARAMSsetting or thezyte_api_session_paramsrequest metadata key.It works similarly to sessionContextParams from Zyte-managed sessions, but it supports arbitrary Zyte API parameters instead of a specific subset.
If it does not define a
"url", the URL of the request triggering a session initialization request will be used.When defining a session config override, you can customize the default and location-setting session initialization parameters through
params().location()can define a default location for its session config override to use when no location is specified otherwise.
Precedence, from higher to lower, is:
Note
The IP address assigned to a session is determined during session initialization and remains fixed for the lifetime of the session. Using a different geolocation in a follow-up request that reuses a session is not supported and results in undefined behavior.
Checking sessions
Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.
Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.
By default, if the session initialization parameters
include actions, and any of them has a returned
status in the response (meaning it failed and stopped execution), the session
is discarded. Actions with onError set to "continue" that fail produce
a continued status instead, and do not cause the session to be discarded.
You can disable this behavior by setting
ZYTE_API_SESSION_INIT_ACTION_FAILURE_INVALIDATES_SESSION to
False.
In addition, if a location is defined through
zyte_api_session_location, ZYTE_API_SESSION_LOCATION or
location(), and the setLocation action
is not available for a given website, the spider is closed with
unsupported_set_location as the close reason; in that case, you should
define a proper session initialization logic for requests
targeting that website.
For sessions initialized without actions, no action-based session check is performed.
To implement your own code to check session responses and determine whether
their session should be kept or discarded, use the
ZYTE_API_SESSION_CHECKER setting. If you need to check session
validity for multiple websites, it is better to define a separate session
config override for each website, each with its own
implementation of check().
The zyte_api_session_location and zyte_api_session_params
request metadata keys, if present in a request that triggers a session
initialization request, will be copied into the session
initialization request, so that they are available when
ZYTE_API_SESSION_CHECKER or
check() are called for a session
initialization request.
If your session checking implementation relies on the response body (e.g. it
uses CSS or XPath expressions), you should make sure that you are getting one,
which might not be the case if you are mostly using Zyte API automatic
extraction, e.g. when using Zyte spider templates. For example, you can use
ZYTE_API_AUTOMAP_PARAMS and ZYTE_API_PROVIDER_PARAMS to
force browserHtml or httpResponseBody to be set
on every Zyte API request:
ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}
Managing pools
scrapy-zyte-api can maintain multiple session pools.
By default, scrapy-zyte-api maintains a separate pool of sessions per domain.
If you use the zyte_api_session_params or
zyte_api_session_location request metadata keys, scrapy-zyte-api
will automatically use separate session pools within the target domain for
those parameters or locations. See pool()
for details.
If you want to customize further which pool is assigned to a given request,
e.g. to have the same pool for multiple domains or use different pools within
the same domain (e.g. for different URL patterns), you can either use the
zyte_api_session_pool request metadata key or use the
pool() method of session config
overrides.
The ZYTE_API_SESSION_POOL_SIZE setting determines the desired number
of concurrent, active, working sessions per pool. The
ZYTE_API_SESSION_POOLS setting allows defining different values
for specific pools.
The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:
When a request asks for a session from a given pool, if the session pool has not yet reached its desired pool size, a session initialization request is triggered. If the session pool has been filled, an existing session is used instead.
When a response associated with a session pool indicates that the session expired, an error over the limit (see
ZYTE_API_SESSION_MAX_ERRORS), or a validity check failure over the limit (seeZYTE_API_SESSION_MAX_CHECK_FAILURES), a session initialization request is triggered to replace that session in the session pool.
The session pool assigned to a request affects the fingerprint of the request. 2 requests with a different session pool ID are considered different requests, i.e. not duplicate requests, even if they are otherwise identical.
Optimizing sessions
For faster crawls and lower costs, specially where session initialization
requests are more expensive than session usage requests (e.g. scenarios where
initialization relies on browserHtml while usage relies on
httpResponseBody), you should try to make your sessions live as long as
possible before they are discarded.
Here are some things you can try:
On some websites, sending too many requests too fast through a session can cause the target website to ban that session.
On those websites, you can increase
ZYTE_API_SESSION_DELAY,ZYTE_API_SESSION_POOL_SIZE, or both, to lower the rate of session reuse.Mind, however, that user-managed sessions expire after 15 minutes since creation or 2 minutes since the last request (see session). At a certain point, increasing
ZYTE_API_SESSION_POOL_SIZEwithout increasingCONCURRENT_REQUESTSandCONCURRENT_REQUESTS_PER_DOMAINaccordingly can be counterproductive.By default, sessions are discarded as soon as an unsuccessful response is received or a validity check is failed.
However, on some websites sessions may remain valid even after a few unsuccessful responses or validity check failures. If that is the case, you might want to increase the corresponding setting,
ZYTE_API_SESSION_MAX_ERRORSorZYTE_API_SESSION_MAX_CHECK_FAILURES, to require a higher number of the corresponding outcome before discarding a session.
If you do not need session checking and your initialization parameters are only browserHtml and actions, Zyte-managed sessions might be a more cost-effective choice, as they live much longer than user-managed sessions.
Overriding session configs
For spiders that target a single website, using settings and request metadata keys for session initialization and session checking should do the job. However, for broad-crawl spiders, multi-website spiders, to modify session-using requests based on session initialization responses, or for code reusability purposes, you might want to define different session configs for different websites.
The default session config is implemented by the
SessionConfig class:
- class scrapy_zyte_api.SessionConfig(crawler)[source]
Default session configuration for scrapy-zyte-api sessions.
- ADDRESS_FIELDS: list[str] = ['addressCountry', 'addressRegion', 'postalCode', 'streetAddress']
List of address fields to use when available, and their order, when creating a pool ID for a request based on the content of the
zyte_api_session_locationmetadata key. Seepool().
- check(response: Response, request: Request) bool[source]
Return
Trueif the session used to fetch response should be kept, returnFalseif it should be discarded, or raiseCloseSpiderif the spider should be closed.The default implementation checks the outcome of the
setLocationaction if a location was defined, and also discards sessions where any action in the initialization response has areturnedstatus (i.e. failed and stopped execution). Both behaviors are described in Checking sessions.If you need to tell whether request is a session initialization request or not, use
is_session_init_request().See also
- enabled(request: Request) bool[source]
Return
Trueif the request should use sessions from session management orFalseotherwise.The default implementation is based on settings and request metadata keys as described in Enabling session management.
- location(request: Request) dict[str, str][source]
Return the address
dictto use for location-based session initialization for request.The default implementation is based on settings and request metadata keys as described in Initializing sessions.
When overriding this method, you should only return a custom value if the default implementation returns an empty
dict, e.g.def location(self, request: Request) -> Dict[str, str]: fallback = {"addressCountry": "US", "addressRegion": "NY", "postalCode": "10001"} return super().location(request) or fallback
Note
An implementation of
location()can technically overridezyte_api_session_locationorZYTE_API_SESSION_LOCATION, but it is not recommended as it breaks the precedence chain that users expect.You should only override this method if you need a location to be used even when no location is specified through request metadata or settings. It can be specially useful if you can determine the right location to use based on the request, e.g.
def location(self, request: Request) -> Dict[str, str]: fallback = {} if postal_code := w3lib.url.url_query_parameter(request.url, "postalCode"): fallback["postalCode"] = postal_code return super().location(request) or fallback
Same as with
zyte_api_session_locationandZYTE_API_SESSION_LOCATION, the returned location fields should match those of theaddressparameter of thesetLocationaction where possible, even when using an implementation ofparams()that does not rely on thesetLocationaction.
- params(request: Request) dict[str, Any] | Awaitable[dict[str, Any]][source]
Return the Zyte API request parameters to use to initialize a session for request.
The default implementation is based on settings and request metadata keys as described in Initializing sessions.
When overriding this method, you should return parameters for the target location, i.e. the output of
location(), unless that output is an emptydict, e.g.def params(self, request: Request) -> Dict[str, Any]: if location := self.location(request): return { "url": "https://example.com/new-session/for-country", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": location["addressCountry"], } return { "url": "https://example.com/new-session", "httpResponseBody": True, }
The returned parameters do not need to include url. If missing, it is picked from the request triggering a session initialization request.
This method can be implemented as a coroutine function. For example:
async def params(self, request: Request) -> Dict[str, Any]: bootstrap_request = Request( "https://example.com/api/get-session", meta={ "zyte_api_session_enabled": False, "zyte_api": { "httpResponseBody": True, "responseCookies": True, }, }, ) response = await self.crawler.engine.download_async(bootstrap_request) return { "url": "https://example.com/new-session", "httpResponseBody": True, "requestCookies": response.raw_api_response["responseCookies"], }
See also
- pool(request: Request) str | PoolConfig[source]
Return the ID of the session pool to use for request, or a
dictwith additional session pool config.The main aspects of the default implementation are described in Managing pools.
When the
zyte_api_session_paramsrequest metadata key is used, the pool ID is the target domain followed by an integer between brackets (e.g.example.com[0]), and a log message indicates which session initialization parameters are associated with that pool ID.When the
zyte_api_session_locationrequest metadata key is used, the pool ID is the target domain followed by an at sign and the comma-separated values of the non-empty fields fromADDRESS_FIELDS(e.g.example.com@US,NY,10001).Instead of a string, this method can also return a
dictcontaining the pool ID under theidkey, and optionally any other key supported byZYTE_API_SESSION_POOLS. For example:def pool(self, request): if "ecommerce.example" in urlparse_cached(request).netloc: return { "id": "ecommerce.example", "delay": 2.0, "size": 16, } return super().pool(request)
The values of optional keys take precedence over the corresponding pool-independent settings, e.g.
delaytakes precedence overZYTE_API_SESSION_DELAYfor the corresponding pool ID, but do not override those defined inZYTE_API_SESSION_POOLS.For any given pool ID, the values of optional keys are only taken into account when the pool ID is first encountered. You cannot use this method to change them at run time.
- process_request(request: Request) Request | None[source]
Process request after it has been assigned a session.
Return
Noneto send the request as is, or return a new request object to replace the original request.The default implementation does not modify the request.
You can combine this method and
check()to modify requests based on session initialization responses. For example:In
__init__(), create a dictionary to store session data:def __init__(self, crawler): super().__init__(crawler) self.session_data = {}
In
check(), store data from the session initialization response insession_data:def check(self, response: Response, request: Request) -> bool: if scrapy_zyte_api.is_session_init_request(request): session_id = scrapy_zyte_api.get_request_session_id(request) self.session_data[session_id] = { "csrf_token": response.css(".csrf-token::text").get(), } return super().check(response, request)
In
process_request(), read the session data and act accordingly, either modifying the request in place where possible, e.g.:def process_request(self, request: Request) -> Optional[Request]: session_id = scrapy_zyte_api.get_request_session_id(request) csrf_token = self.session_data[session_id]["csrf_token"] request.headers["CSRF-Token"] = csrf_token
Or returning an entirely new request, e.g.:
def process_request(self, request: Request) -> Optional[Request]: session_id = get_request_session_id(request) csrf_token = self.session_data[session_id]["csrf_token"] new_url = w3lib.url.add_or_replace_parameter(request.url, "csrf_token", csrf_token) return request.replace(url=new_url)
To define a different session config for a given URL pattern, install
web-poet and define a subclass of
SessionConfig decorated with
session_config():
- scrapy_zyte_api.session_config(include: str | ~collections.abc.Iterable[str], *, instead_of: type[~scrapy_zyte_api._session.SessionConfig] | None = <class 'scrapy_zyte_api._session.SessionConfig'>, exclude: str | ~collections.abc.Iterable[str] | None = None, priority: int = 500, **kwargs)
If you only need to override the SessionConfig.check or SessionConfig.params methods for scenarios involving a
location, you may subclass LocationSessionConfig
instead:
- class scrapy_zyte_api.LocationSessionConfig(crawler)[source]
SessionConfigsubclass to minimize boilerplate when implementing location-specific session configs, i.e. session configs where the default values should be used unless a location is set.Provides counterparts to some
SessionConfigmethods that are only called when a location is set, and get that location as a parameter.
If in a session config implementation or in any other Scrapy component you need
to tell whether a request is a session initialization request or not, use is_session_init_request():
- scrapy_zyte_api.is_session_init_request(request)[source]
Return
Trueif the request is a session initialization request orFalseotherwise.
To get the session ID of a given request, use:
- scrapy_zyte_api.get_request_session_id(request: Request) str | None[source]
Return the session ID of request, or
Noneif it does not have a session ID assigned.
Classes decorated with session_config() are registered
into session_config_registry:
- scrapy_zyte_api.session_config_registry
Instance of
web_poet.rules.RulesRegistrythat holds session configs.
Session retry policies
The following retry policies are designed to work well with session management
(see Enabling session management). They are meant for
ZYTE_API_SESSION_RETRY_POLICY:
- scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY
Alternative to the default retry policy for session management that does not retry 520 responses.
- scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY
Alternative to the aggresive retry policy for session management that does not retry 520 and 521 responses.
Note
When using python-zyte-api 0.5.2 or lower, this is the same as
SESSION_DEFAULT_RETRY_POLICY.
Spider closers
Session management can close your spider early in the following scenarios:
bad_session_inits: Too many session initializations failed in a row for a given session pool.You can use the
ZYTE_API_SESSION_MAX_BAD_INITSandZYTE_API_SESSION_MAX_BAD_INITS_PER_POOLsettings to adjust that maximum.pool_error: There was an error determining the session pool ID for some request.It is most likely the result of a bad implementation of
SessionConfig.pool; the logs should contain an error message with a traceback for such errors.unsupported_set_location: You usedZYTE_API_SESSION_LOCATIONorzyte_api_session_locationto configure session initialization with thesetLocationaction, but Zyte API does not yet supportsetLocationfor the target website.
A custom SessionConfig.check
implementation may also close your spider with a custom reason by raising a
CloseSpider exception.
Troubleshooting
RuntimeError: Could not get a session ID
If you see this exception, indicating that after a given number of attempts, with a given minimum wait time between attempts, it was not possible to get a session ID from the session rotation queue, consider the following possibilities:
A bug in your session validation code may be causing it to return
Falsefor a valid response.This is specially likely if you see this issue for very few, specific requests, while most requests work fine.
The values of the
ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTSandZYTE_API_SESSION_QUEUE_WAIT_TIMEsettings may be too low for your scenario, in which case you can modify them accordingly.