Session management
Zyte API provides powerful session APIs:
Client-managed sessions give you full control over session management.
Server-managed sessions let Zyte API handle session management for you.
When using scrapy-zyte-api, you can use these session APIs through the corresponding Zyte API fields (session, sessionContext).
However, scrapy-zyte-api also provides its own session management API, similar to that of server-managed sessions, but built on top of client-managed sessions.
scrapy-zyte-api session management offers some advantages over server-managed sessions:
You can perform session validity checks, so that the sessions of responses that do not pass those checks are refreshed, and the responses retried with a different session.
You can use arbitrary Zyte API parameters for session initialization, beyond those that sessionContextParameters supports.
You have granular control over the session pool size, max errors, etc. See Optimizing sessions and Overriding session configs.
However, scrapy-zyte-api session management is not a replacement for server-managed sessions or client-managed sessions:
Server-managed sessions offer a longer life time than the client-managed sessions that scrapy-zyte-api session management uses, so as long as you do not need one of the scrapy-zyte-api session management features, server-managed sessions can be significantly more efficient (fewer total sessions needed per crawl).
Zyte API can also optimize server-managed sessions based on the target website. With scrapy-zyte-api session management, you need to handle optimization yourself.
Client-managed sessions offer full control over session management, while scrapy-zyte-api session management removes some of that control to provide an easier API for supported use cases.
Enabling session management
To enable session management for all requests, set
ZYTE_API_SESSION_ENABLED
to True
. You can also toggle session
management on or off for specific requests using the
zyte_api_session_enabled
request metadata key, or override the
enabled()
method of a session config
override.
By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each initialized with a browser request targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic though request metadata, settings and session config overrides.
For session management to work as expected, your
ZYTE_API_RETRY_POLICY
should not retry 520 and 521 responses:
If you are using the default retry policy (
zyte_api_retrying
) oraggressive_retrying
:If you are using the scrapy-zyte-api add-on, these built-in retry policies are automatically replaced with a matching session-specific retry policy, either
SESSION_DEFAULT_RETRY_POLICY
orSESSION_AGGRESSIVE_RETRY_POLICY
.If you are not using the scrapy-zyte-api add-on, set
ZYTE_API_RETRY_POLICY
manually to eitherSESSION_DEFAULT_RETRY_POLICY
orSESSION_AGGRESSIVE_RETRY_POLICY
. For example:ZYTE_API_RETRY_POLICY = "scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY"
If you are using a custom retry policy, modify it to not retry 520 and 521 responses.
Initializing sessions
To change the default session initialization parameters, you have the following options:
To initialize sessions with a given location, use the
ZYTE_API_SESSION_LOCATION
setting or thezyte_api_session_location
request metadata key.The value should be a dictionary with keys supported by the
address
field of thesetLocation
action, e.g.{ "addressCountry": "US", "addressRegion": "NY", "postalCode": "10001", "streetAddress": "3 Penn Plz", }
By default, the location is set using the
setLocation
action. A session config override can change that throughparams()
.For session initialization with arbitrary Zyte API request fields, use the
ZYTE_API_SESSION_PARAMS
setting or thezyte_api_session_params
request metadata key.It works similarly to sessionContextParams from server-managed sessions, but it supports arbitrary Zyte API parameters instead of a specific subset.
If it does not define a
"url"
, the URL of the request triggering a session initialization request will be used.When defining a session config override, you can customize the default and location-setting session initialization parameters through
params()
.location()
can define a default location for its session config override to use when no location is specified otherwise.
Precedence, from higher to lower, is:
Checking sessions
Responses from a session can be checked for session validity. If a response does not pass a session validity check, the session is discarded, and the request is retried with a different session.
Session checking can be useful to work around scenarios where session initialization fails, e.g. due to rendering issues, IP-geolocation mismatches, A-B tests, etc. It can also help in cases where website sessions expire before Zyte API sessions.
By default, if a location is defined through
zyte_api_session_location
, ZYTE_API_SESSION_LOCATION
or
location()
, even if the parameters used
for session initialization actually come from
zyte_api_session_params
or ZYTE_API_SESSION_LOCATION
, the
outcome of the first setLocation
action used, if any, is checked. If the
action fails, the session is discarded. If the action is not even available for
a given website, the spider is closed with unsupported_set_location
as the
close reason; in that case, you should define a proper session
initialization logic for requests targeting that website.
For sessions initialized without a configured location, no session check is
performed, sessions are assumed to be fine until they expire or are banned.
That is so even if session initialization parameters include a setLocation
action.
To implement your own code to check session responses and determine whether
their session should be kept or discarded, use the
ZYTE_API_SESSION_CHECKER
setting. If you need to check session
validity for multiple websites, it is better to define a separate session
config override for each website, each with its own
implementation of check()
.
The zyte_api_session_location
and zyte_api_session_params
request metadata keys, if present in a request that triggers a session
initialization request, will be copied into the session
initialization request, so that they are available when
ZYTE_API_SESSION_CHECKER
or
check()
are called for a session
initialization request.
If your session checking implementation relies on the response body (e.g. it
uses CSS or XPath expressions), you should make sure that you are getting one,
which might not be the case if you are mostly using Zyte API automatic
extraction, e.g. when using Zyte spider templates. For example, you can use
ZYTE_API_AUTOMAP_PARAMS
and ZYTE_API_PROVIDER_PARAMS
to
force browserHtml or httpResponseBody to be set
on every Zyte API request:
ZYTE_API_AUTOMAP_PARAMS = {"browserHtml": True}
ZYTE_API_PROVIDER_PARAMS = {"browserHtml": True}
Managing pools
scrapy-zyte-api can maintain multiple session pools.
By default, scrapy-zyte-api maintains a separate pool of sessions per domain.
If you use the zyte_api_session_params
or
zyte_api_session_location
request metadata keys, scrapy-zyte-api
will automatically use separate session pools within the target domain for
those parameters or locations. See pool()
for details.
If you want to customize further which pool is assigned to a given request,
e.g. to have the same pool for multiple domains or use different pools within
the same domain (e.g. for different URL patterns), you can either use the
zyte_api_session_pool
request metadata key or use the
pool()
method of session config
overrides.
The ZYTE_API_SESSION_POOL_SIZE
setting determines the desired number
of concurrent, active, working sessions per pool. The
ZYTE_API_SESSION_POOL_SIZES
setting allows defining different values
for specific pools.
The actual number of sessions created for a session pool depends on the number of requests that ask for a session from that pool, and the life time of those sessions:
When a request asks for a session from a given pool, if the session pool has not yet reached its desired pool size, a session initialization request is triggered. If the session pool has been filled, an existing session is used instead.
When a response associated with a session pool indicates that the session expired, an error over the limit (see
ZYTE_API_SESSION_MAX_ERRORS
), or a failed validity check, a session initialization request is triggered to replace that session in the session pool.
Optimizing sessions
For faster crawls and lower costs, specially where session initialization
requests are more expensive than session usage requests (e.g. scenarios where
initialization relies on browserHtml
while usage relies on
httpResponseBody
), you should try to make your sessions live as long as
possible before they are discarded.
Here are some things you can try:
On some websites, sending too many requests too fast through a session can cause the target website to ban that session.
On those websites, you can increase the number of sessions in the pool (
ZYTE_API_SESSION_POOL_SIZE
). The more different sessions you use, the more slowly you send requests through each session.Mind, however, that client-managed sessions expire after 15 minutes since creation or 2 minutes since the last request. At a certain point, increasing
ZYTE_API_SESSION_POOL_SIZE
without increasingCONCURRENT_REQUESTS
andCONCURRENT_REQUESTS_PER_DOMAIN
accordingly can be counterproductive.By default, sessions are discarded as soon as an unsuccessful response is received.
However, on some websites sessions may remain valid even after a few unsuccessful responses. If that is the case, you might want to increase
ZYTE_API_SESSION_MAX_ERRORS
to require a higher number of unsuccessful responses before discarding a session.
If you do not need session checking and your initialization parameters are only browserHtml and actions, server-managed sessions might be a more cost-effective choice, as they live much longer than client-managed sessions.
Overriding session configs
For spiders that target a single website, using settings and request metadata keys for session initialization and session checking should do the job. However, for broad-crawl spiders, multi-website spiders, or for code reusability purposes, you might want to define different session configs for different websites.
The default session config is implemented by the
SessionConfig
class:
- class scrapy_zyte_api.SessionConfig(crawler)[source]
Default session configuration for scrapy-zyte-api sessions.
- ADDRESS_FIELDS: List[str] = ['addressCountry', 'addressRegion', 'postalCode', 'streetAddress']
List of address fields to use when available, and their order, when creating a pool ID for a request based on the content of the
zyte_api_session_location
metadata key. Seepool()
.
- check(response: Response, request: Request) bool [source]
Return
True
if the session used to fetch response should be kept, returnFalse
if it should be discarded, or raiseCloseSpider
if the spider should be closed.The default implementation checks the outcome of the
setLocation
action if a location was defined, as described in Checking sessions.If you need to tell whether request is a session initialization request or not, use
is_session_init_request()
.See also
- enabled(request: Request) bool [source]
Return
True
if the request should use sessions from session management orFalse
otherwise.The default implementation is based on settings and request metadata keys as described in Enabling session management.
- location(request: Request) Dict[str, str] [source]
Return the address
dict
to use for location-based session initialization for request.The default implementation is based on settings and request metadata keys as described in Initializing sessions.
When overriding this method, you should only return a custom value if the default implementation returns an empty
dict
, e.g.def location(self, request: Request) -> Dict[str, str]: fallback = {"addressCountry": "US", "addressRegion": "NY", "postalCode": "10001"} return super().location(request) or fallback
Note
An implementation of
location()
can technically overridezyte_api_session_location
orZYTE_API_SESSION_LOCATION
, but it is not recommended as it breaks the precedence chain that users expect.You should only override this method if you need a location to be used even when no location is specified through request metadata or settings. It can be specially useful if you can determine the right location to use based on the request, e.g.
def location(self, request: Request) -> Dict[str, str]: fallback = {} if postal_code := w3lib.url.url_query_parameter(request.url, "postalCode"): fallback["postalCode"] = postal_code return super().location(request) or fallback
Same as with
zyte_api_session_location
andZYTE_API_SESSION_LOCATION
, the returned location fields should match those of theaddress
parameter of thesetLocation
action where possible, even when using an implementation ofparams()
that does not rely on thesetLocation
action.
- params(request: Request) Dict[str, Any] [source]
Return the Zyte API request parameters to use to initialize a session for request.
The default implementation is based on settings and request metadata keys as described in Initializing sessions.
When overriding this method, you should return parameters for the target location, i.e. the output of
location()
, unless that output is an emptydict
, e.g.def params(self, request: Request) -> Dict[str, Any]: if location := self.location(request): return { "url": "https://example.com/new-session/for-country", "httpResponseBody": True, "httpRequestMethod": "POST", "httpRequestText": location["addressCountry"], } return { "url": "https://example.com/new-session", "httpResponseBody": True, }
The returned parameters do not need to include url. If missing, it is picked from the request triggering a session initialization request.
See also
- pool(request: Request) str [source]
Return the ID of the session pool to use for request.
The main aspects of the default implementation are described in Managing pools.
When the
zyte_api_session_params
request metadata key is used, the pool ID is the target domain followed by an integer between brackets (e.g.example.com[0]
), and a log message indicates which session initialization parameters are associated with that pool ID.When the
zyte_api_session_location
request metadata key is used, the pool ID is the target domain followed by an at sign and the comma-separated values of the non-empty fields fromADDRESS_FIELDS
(e.g.example.com@US,NY,10001
).
To define a different session config for a given URL pattern, install
web-poet and define a subclass of
SessionConfig
decorated with
session_config()
:
- scrapy_zyte_api.session_config(include, *, instead_of: ~typing.Type | None = <class 'scrapy_zyte_api._session.SessionConfig'>, exclude=None, priority: int = 500, **kwargs)
Mark the decorated
SessionConfig
subclass as the session config to use for the specified URL patterns.Usage example:
from typing import Any, Dict from scrapy import Request from scrapy.http.response import Response from scrapy_zyte_api import SessionConfig, session_config @session_config(["ecommerce.de.example, ecommerce.us.example"]) class EcommerceExampleSessionConfig(SessionConfig): def pool(self, request: Request) -> str: return "ecommerce.example" def params(self, request: Request) -> Dict[str, Any]: if location := self.location(request): return { "url": request.url, "browserHtml": True, "actions": [ { "action": "type", "selector": {"type": "css", "value": ".zipcode"}, "text": location["postalCode"], }, { "action": "click", "selector": {"type": "css", "value": "[type='submit']"}, }, ], } return super().params(request) def check(self, response: Response, request: Request) -> bool: if location := self.location(request): return response.css(".zipcode::text").get() == location["postalCode"] return super().check(response, request)
Your
SessionConfig
subclass must be defined in a module that gets imported at run time. SeeSCRAPY_POET_DISCOVER
in the scrapy-poet setting reference.The parameters of this decorator are those of
web_poet.handle_urls()
, only instead_of isSessionConfig
by default, to_return is not supported, and session configs are registered in their own rule registry.
If you only need to override the SessionConfig.check
or SessionConfig.params
methods for scenarios involving a
location, you may subclass LocationSessionConfig
instead:
- class scrapy_zyte_api.LocationSessionConfig(crawler)[source]
SessionConfig
subclass to minimize boilerplate when implementing location-specific session configs, i.e. session configs where the default values should be used unless a location is set.Provides counterparts to some
SessionConfig
methods that are only called when a location is set, and get that location as a parameter.
If in a session config implementation or in any other Scrapy component you need
to tell whether a request is a session initialization request or not, use is_session_init_request()
:
- scrapy_zyte_api.is_session_init_request(request)[source]
Return
True
if the request is a session initialization request orFalse
otherwise.
Classes decorated with session_config()
are registered
into session_config_registry
:
- scrapy_zyte_api.session_config_registry
Instance of
web_poet.rules.RulesRegistry
that holds session configs.
Session retry policies
The following retry policies are designed to work well with session management (see Enabling session management):
- scrapy_zyte_api.SESSION_DEFAULT_RETRY_POLICY
Alternative to the default retry policy for session management that does not retry 520 responses.
- scrapy_zyte_api.SESSION_AGGRESSIVE_RETRY_POLICY
Alternative to the aggresive retry policy for session management that does not retry 520 and 521 responses.
Note
When using python-zyte-api 0.5.2 or lower, this is the same as
SESSION_DEFAULT_RETRY_POLICY
.
Spider closers
Session management can close your spider early in the following scenarios:
bad_session_inits
: Too many session initializations failed in a row for a given session pool.You can use the
ZYTE_API_SESSION_MAX_BAD_INITS
andZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
settings to adjust that maximum.pool_error
: There was an error determining the session pool ID for some request.It is most likely the result of a bad implementation of
SessionConfig.pool
; the logs should contain an error message with a traceback for such errors.unsupported_set_location
: You usedZYTE_API_SESSION_LOCATION
orzyte_api_session_location
to configure session initialization with thesetLocation
action, but Zyte API does not yet supportsetLocation
for the target website.
A custom SessionConfig.check
implementation may also close your spider with a custom reason by raising a
CloseSpider
exception.
Session stats
The following stats exist for scrapy-zyte-api session management:
scrapy-zyte-api/sessions/pools/{pool}/init/check-error
Number of times that a session for pool
{pool}
triggered an unexpected exception during its session validation check right after initialization.It is most likely the result of a bad implementation of
SessionConfig.check
; the logs should contain an error message with a traceback for such errors.scrapy-zyte-api/sessions/pools/{pool}/init/check-failed
Number of times that a session from pool
{pool}
failed its session validation check right after initialization.scrapy-zyte-api/sessions/pools/{pool}/init/check-passed
Number of times that a session from pool
{pool}
passed its session validation check right after initialization.scrapy-zyte-api/sessions/pools/{pool}/init/failed
Number of times that initializing a session for pool
{pool}
resulted in an unsuccessful response.scrapy-zyte-api/sessions/pools/{pool}/init/param-error
Number of times that initializing a session for pool
{pool}
triggered an unexpected exception when obtaining the Zyte API parameters for session initialization.It is most likely the result of a bad implementation of
SessionConfig.params
; the logs should contain an error message with a traceback for such errors.scrapy-zyte-api/sessions/pools/{pool}/use/check-error
Number of times that a response that used a session from pool
{pool}
triggered an unexpected exception during its session validation check.It is most likely the result of a bad implementation of
SessionConfig.check
; the logs should contain an error message with a traceback for such errors.scrapy-zyte-api/sessions/pools/{pool}/use/check-failed
Number of times that a response that used a session from pool
{pool}
failed its session validation check.scrapy-zyte-api/sessions/pools/{pool}/use/check-passed
Number of times that a response that used a session from pool
{pool}
passed its session validation check.scrapy-zyte-api/sessions/pools/{pool}/use/expired
Number of times that a session from pool
{pool}
expired.scrapy-zyte-api/sessions/pools/{pool}/use/failed
Number of times that a request that used a session from pool
{pool}
got an unsuccessful response.scrapy-zyte-api/sessions/use/disabled
Number of processed requests for which session management was disabled.