Settings
Settings for scrapy-zyte-api.
ZYTE_API_AUTO_FIELD_STATS
Default: False
Enables stats that indicate which requested fields obtained through scrapy-poet integration come directly from Zyte API automatic extraction.
If for any request no page object class is used to override Zyte API automatic extraction fields for a given item type, the following stat is set:
"scrapy-zyte-api/auto_fields/<item class import path>": "(all fields)"
Note
A literal (all fields)
string is used as value, not a list with
all fields.
If for any request a custom page object class is used to override some Zyte API automatic extraction fields, the following stat is set:
"scrapy-zyte-api/auto_fields/<override class import path>": (
"<space-separated list of fields not overridden>"
)
Note
zyte_common_items.fields.is_auto_field()
is used to determine
whether a field has been overridden or not.
ZYTE_API_AUTOMAP_PARAMS
Default: {}
dict
of parameters to be combined with automatic request
parameters.
These parameters are merged with zyte_api_automap
parameters.
zyte_api_automap
parameters take precedence.
This setting has no effect on requests with manual request parameters.
When using transparent mode, be careful of which
parameters you define in this setting. In transparent mode, all Scrapy requests
go through Zyte API, even requests that Scrapy sends automatically, such as
those for robots.txt
files when ROBOTSTXT_OBEY
is True
, or those for sitemaps when using
SitemapSpider
. Certain parameters, like
browserHtml or screenshot, are not meant to be
used for every single request.
If zyte_api_default_params
in Request.meta
is set to False
, this setting is ignored for
that request.
See Default parameters.
ZYTE_API_BROWSER_HEADERS
Default: {"Referer": "referer"}
Determines headers that can be mapped as requestHeaders.
It is a dict
, where keys are header names and values are the key that
represents them in requestHeaders.
ZYTE_API_DEFAULT_PARAMS
Default: {}
dict
of parameters to be combined with manual request parameters.
You may set zyte_api
to an empty dict
to only use the
parameters defined here for that request.
These parameters are merged with zyte_api
parameters.
zyte_api
parameters take precedence.
This setting has no effect on requests with automatic request parameters.
If zyte_api_default_params
in Request.meta
is set to False
, this setting is ignored for
that request.
See Default parameters.
ZYTE_API_ENABLED
Default: True
Can be set to False
to disable scrapy-zyte-api.
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS
Default: scrapy_poet.ScrapyPoetRequestFingerprinter
if scrapy-poet is
installed, else scrapy.utils.request.RequestFingerprinter
Request fingerprinter to for requests that do not go through Zyte API. See Request fingerprinting.
ZYTE_API_KEY
Default: None
Your Zyte API key.
You can alternatively define an environment variable with the same name.
Tip
On Scrapy Cloud, this setting is defined automatically.
ZYTE_API_LOG_REQUESTS
Default: False
Set this to True
and LOG_LEVEL
to "DEBUG"
to enable the logging of debug messages that indicate the JSON object sent on
every Zyte API request.
For example:
Sending Zyte API extract request: {"url": "https://example.com", "httpResponseBody": true}
See also: ZYTE_API_LOG_REQUESTS_TRUNCATE
.
ZYTE_API_LOG_REQUESTS_TRUNCATE
Default: 64
Determines the maximum length of any string value in the JSON object logged
when ZYTE_API_LOG_REQUESTS
is enabled, excluding object keys.
To disable truncation, set this to 0
.
ZYTE_API_MAX_REQUESTS
Default: None
When set to an integer value > 0, the spider will close when the number of Zyte
API requests reaches it, with closespider_max_zapi_requests
as the close
reason.
Note that requests with error responses that cannot be retried or exceed their retry limit also count here.
ZYTE_API_PRESERVE_DELAY
Default: False if
AUTOTHROTTLE_ENABLED
else True
By default, requests for which use of scrapy-zyte-api is enabled get
zyte-api@
prepended to their download slot ID, and if
AUTOTHROTTLE_ENABLED
is True
, the
corresponding download slot gets its download delay reset to 0. This nullifies
the effects of the AutoThrottle extension for Zyte
API requests, delegating throttling management to Zyte API.
If AUTOTHROTTLE_ENABLED
is False
,
but you have a download delay set through DOWNLOAD_DELAY
and you do not want that delay to affect Zyte API
requests, set this setting to False
.
If you have AUTOTHROTTLE_ENABLED
enabled, and you want it to also work on Zyte API requests, set this setting to
True
.
ZYTE_API_PROVIDER_PARAMS
Default: {}
Defines additional request parameters to use in Zyte API requests sent by the scrapy-poet integration.
For example:
ZYTE_API_PROVIDER_PARAMS = {
"requestCookies": [
{"name": "a", "value": "b", "domain": "example.com"},
],
}
ZYTE_API_RETRY_POLICY
Default: "zyte_api.aio.retry.zyte_api_retrying"
Determines the retry policy for Zyte API requests.
It must be a string with the import path of a tenacity.AsyncRetrying
subclass.
Note
Settings must be picklable
,
and retry policies are not,
so you cannot assign a retry policy class directly to this setting, you
must use their import path as a string instead.
See Retries.
ZYTE_API_SESSION_CHECKER
Default: None
A Scrapy component (or its import path as a string)
that defines a check
method.
If check
returns True
, the response session is considered valid; if
check
returns False
, the response session is considered invalid, and
will be discarded. check
can also raise a
CloseSpider
exception to close the spider.
If defined, the check
method is called on every response that is using a
session managed by scrapy-zyte-api. If not defined, the
default implementation checks the outcome of the setLocation
action if
session initialization was location-based, as described in
Checking sessions.
Example:
from scrapy import Request
from scrapy.http.response import Response
class MySessionChecker:
def check(self, response: Response, request: Request) -> bool:
return bool(response.css(".is_valid"))
ZYTE_API_SESSION_CHECKER = MySessionChecker
Because the session checker is a Scrapy component, you can access the crawler object, for example to read settings:
from scrapy import Request
from scrapy.http.response import Response
class MySessionChecker:
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def __init__(self, crawler):
location = crawler.settings["ZYTE_API_SESSION_LOCATION"]
self.postal_code = location["postalCode"]
def check(self, response: Response, request: Request) -> bool:
return response.css(".postal_code::text").get() == self.postal_code
ZYTE_API_SESSION_CHECKER = MySessionChecker
ZYTE_API_SESSION_ENABLED
Default: False
Enables scrapy-zyte-api session management.
ZYTE_API_SESSION_LOCATION
Default: {}
See Initializing sessions for general information about location configuration and parameter precedence.
Example:
ZYTE_API_SESSION_LOCATION = {"postalCode": "10001"}
ZYTE_API_SESSION_MAX_BAD_INITS
Default: 8
The maximum number of scrapy-zyte-api sessions per pool that
are allowed to fail their session check right after creation in a row. If the
maximum is reached, the spider closes with bad_session_inits
as the close
reason.
To override this value for specific pools, use
ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
.
ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
Default: {}
dict
where keys are pool IDs and values are
overrides of ZYTE_API_SESSION_POOL_SIZE
for those pools.
ZYTE_API_SESSION_MAX_ERRORS
Default: 1
Maximum number of unsuccessful responses allowed for any given session before discarding the session.
You might want to increase this number if you find that a session may continue to work even after an unsuccessful response. See Optimizing sessions.
Note
This setting does not affect session checks
(ZYTE_API_SESSION_CHECKER
). A session is always discarded the
first time it fails its session check.
ZYTE_API_SESSION_PARAMS
Default: {}
See Initializing sessions for general information about defining session initialization parameters and parameter precedence.
Example:
ZYTE_API_SESSION_PARAMS = {
"browserHtml": True,
"actions": [
{
"action": "setLocation",
"address": {"postalCode": "10001"},
}
],
}
Tip
The example above is equivalent to setting
ZYTE_API_SESSION_LOCATION
to {"postalCode": "10001"}
.
ZYTE_API_SESSION_POOL_SIZE
Default: 8
The maximum number of active scrapy-zyte-api sessions to keep per pool.
To override this value for specific pools, use
ZYTE_API_SESSION_POOL_SIZES
.
Increase this number to lower the frequency with which requests are sent through each session, which on some websites may increase the lifetime of each session. See Optimizing sessions.
ZYTE_API_SESSION_POOL_SIZES
Default: {}
dict
where keys are pool IDs and values are
overrides of ZYTE_API_SESSION_POOL_SIZE
for those pools.
ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS
Default: 60
scrapy-zyte-api maintains a rotation queue of ready-to-use sessions per pool. At some points, the queue might be empty for a given pool because all its sessions are in the process of being initialized or refreshed.
If the queue is empty when trying to assign a session to a request,
scrapy-zyte-api will wait some time
(ZYTE_API_SESSION_QUEUE_WAIT_TIME
), and then try to get a session
from the queue again.
Use this setting to configure the maximum number of attempts before giving up
and raising a RuntimeError
exception.
ZYTE_API_SESSION_QUEUE_WAIT_TIME
Default: 1.0
Number of seconds to wait between attempts to get a session from a rotation queue.
See ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS
for details.
ZYTE_API_SKIP_HEADERS
Default: ["Cookie"]
Determines headers that must not be mapped as customHttpRequestHeaders.
ZYTE_API_TRANSPARENT_MODE
Default: False
See Transparent mode.
ZYTE_API_USE_ENV_PROXY
Default: False
Set to True
to make Zyte API requests respect system proxy settings. See
Using a proxy.