heritrix3.api

exception heritrix3.api.HeritrixAPIError(message: str, *args, **kwargs)[source]

Error as response from Heritrix3 REST API.

Parameters
  • message (str) – Error description / message.

  • response (Optional[requests.Response]) – Optional api response object.

class heritrix3.api.HeritrixAPI(host: str = 'https://localhost:8443/engine', user: str = 'admin', passwd: str = 'admin', verbose: bool = False, insecure: bool = True, headers: Optional[Dict[str, str]] = None, timeout: Optional[Union[int, float]] = None)[source]
send_file(job_name: str, filepath: os.PathLike, name: Optional[str] = None) → bool[source]
send_content(job_name: str, filecontent: Union[bytes, BinaryIO], name: str) → bool[source]
retrieve_file(job_name: str, local_filepath: os.PathLike, job_filepath: Union[str, os.PathLike], overwrite: bool = False) → bool[source]
info(job_name: Optional[str] = None, raw: bool = False) → Union[str, requests.models.Response][source]
list_jobs(status: Optional[str] = None, unbuilt: bool = False) → List[str][source]
get_job_state(job_name: str) → Optional[str][source]
get_crawl_exit_state(job_name: str) → Optional[str][source]
get_job_actions(job_name: str) → List[str][source]
wait_for_action(job_name: str, action: str, timeout: Union[int, float] = 20, poll_delay: Union[int, float] = 1) → bool[source]
wait_for_jobstate(job_name: str, state: str, timeout: Union[int, float] = 20, poll_delay: Union[int, float] = 1) → bool[source]
create(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
add(job_dir: str, raw: bool = False) → Union[str, requests.models.Response][source]
rescan(raw: bool = False) → Union[str, requests.models.Response][source]
copy(job_name: str, new_job_name: str, as_profile: bool = False, raw: bool = False) → Union[str, requests.models.Response][source]
build(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
launch(job_name: str, checkpoint: Optional[str] = None, raw: bool = False) → Union[str, requests.models.Response][source]
pause(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
unpause(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
terminate(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
teardown(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
checkpoint(job_name: str, raw: bool = False) → Union[str, requests.models.Response][source]
execute_script(job_name: str, script: str, engine: str = 'beanshell', raw: bool = False) → Union[str, requests.models.Response][source]
get_config(job_name: str, raw: bool = True) → str[source]
send_config(job_name: str, cxml_filepath: os.PathLike) → bool[source]
get_config_url(job_name: str) → str[source]
get_launchid(job_name: str) → Optional[str][source]
crawl_report(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
seeds_report(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
hosts_report(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
mimetypes_report(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
responsecodes_report(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
job_log(job_name: str) → Optional[str][source]
crawl_log(job_name: str, launch_id: Optional[str] = None) → Optional[str][source]
list_files(job_name: str, gather_files: bool = True, gather_folders: bool = True) → List[str][source]
list_warcs(job_name: str, launchid: Optional[str] = None) → Optional[List[str]][source]
retrieve_warcs(job_name: str, local_folderpath: os.PathLike, launchid: Optional[str] = None, warcs_job_filepaths: Optional[List[Union[str, os.PathLike]]] = None, overwrite: bool = False) → Optional[int][source]
delete_job_dir(job_name: str) → None[source]
heritrix3.api.disable_ssl_warnings()[source]

Quieten SSL insecure warnings.

See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings