Contents¶
Overview¶
docs |
|
---|---|
tests |
|
package |
A internetarchive/heritrix3 python REST API client.
Free software: MIT license
Installation¶
pip install heritrix3
You can also install the in-development version with:
pip install https://github.com/Querela/python-heritrix3-client/archive/master.zip
Documentation¶
Development¶
To run all the tests run:
tox
Note, to combine the coverage data from all the tox environments run:
Windows |
set PYTEST_ADDOPTS=--cov-append
tox
|
---|---|
Other |
PYTEST_ADDOPTS=--cov-append tox
|
Usage¶
To use Heritrix3 Client in a project:
import heritrix3
A basic workflow follows:
# basic imports
from pathlib import Path
from pprint import pprint
from heritrix3 import disable_ssl_warnings
from heritrix3 import HeritrixAPI
# disable insecure requests warning
disable_ssl_warnings()
# create the REST API client
api = HeritrixAPI(host="https://localhost:8443/engine", user="admin", passwd="admin", verbose=True)
# dump info
pprint(api.info(raw=False))
# similar to info (output wise)
pprint(api.rescan(raw=False))
# alternative also
pprint(HeritrixAPI._xml2json(api.rescan(raw=True).text))
How to work with jobs:
# create job (if it exists, it will not do anything?)
jobname = "test"
pprint(api.create(jobname))
assert jobname in api.list_jobs()
# list all jobs
api.list_jobs()
# and their actions
api.get_job_actions(jobname)
# send a config file (that allows a separate seeds.txt file)
p = (Path.cwd() / "..").resolve() / "examples" / "crawler-beans.seed_file.cxml"
api.send_config(jobname, p)
# create + send seeds
p = (Path.cwd() / "..").resolve() / "examples" / "seeds.txt"
p.write_text("https://www.google.com/\n")
api.send_file(jobname, p) # or with "seeds.txt" as third param
# build job (required for some functions, like script execution)
pprint(api.build(jobname))
# can be used to wait until an action is available
# might block indefinitely if this actions does not exists or won't ever be available
api.wait_for_action(jobname, "launch")
# launch the job
pprint(api.launch(jobname))
# pause a job
pprint(api.pause(jobname))
# checkpoint
pprint(api.checkpoint(jobname))
# unpause a job
pprint(api.unpause(jobname))
# terminate the job
pprint(api.terminate(jobname))
# unbuild/teardown the job
pprint(api.teardown(jobname))
# NOTE: the following requires the job to be built! (so no teardown)
# clean up the job (all files are gone)
# NOTE: you should be careful that the job is not still running
api.delete_job_dir(jobname)
pprint(api.rescan())
assert jobname not in api.list_jobs()
See the official Heritrix REST API docs.
Show job information:
job_info_dict = api.info(jobname)
job_xml_txt = api.info(jobname, raw=True).text
config_xml_txt = api.get_config(jobname)
# crawl report (plain text)
launchid = None # "latest"
report_txt = api.crawl_report(jobname, launchid)
# the following functions require the job to be built
# list the jobs files
pprint(api.list_files(jobname))
# show the warcs (after pause/terminate)
pprint(api.list_warcs(jobname))
# launch id
launchid = api.get_launchid(jobname)
If you require a basic heritrix setup, you may use the ekoerner/heritrix Docker image.
CLI¶
The Heritrix3 client library also provides a commandline utility, named heritrix3:
heritrix3 --help
# configure your heritrix REST endpoint:
heritrix3 --host https://localhost:8443/engine --username admin --password admin
# interactive python shell
heritrix3 shell
# list jobs, actions
heritrix3 list-jobs
heritrix3 list-jobs-actions
# show info
heritrix3 info
# show job info for "test"
heritrix3 info test
Reference¶
heritrix3¶
heritrix3.api¶
-
exception
heritrix3.api.
HeritrixAPIError
(message: str, *args, **kwargs)[source]¶ Error as response from Heritrix3 REST API.
- Parameters
message (str) – Error description / message.
response (Optional[requests.Response]) – Optional api response object.
-
class
heritrix3.api.
HeritrixAPI
(host: str = 'https://localhost:8443/engine', user: str = 'admin', passwd: str = 'admin', verbose: bool = False, insecure: bool = True, headers: Optional[Dict[str, str]] = None, timeout: Optional[Union[int, float]] = None)[source]¶ -
-
retrieve_file
(job_name: str, local_filepath: os.PathLike, job_filepath: Union[str, os.PathLike], overwrite: bool = False) → bool[source]¶
-
info
(job_name: Optional[str] = None, raw: bool = False) → Union[str, requests.models.Response][source]¶
-
wait_for_action
(job_name: str, action: str, timeout: Union[int, float] = 20, poll_delay: Union[int, float] = 1) → bool[source]¶
-
wait_for_jobstate
(job_name: str, state: str, timeout: Union[int, float] = 20, poll_delay: Union[int, float] = 1) → bool[source]¶
-
copy
(job_name: str, new_job_name: str, as_profile: bool = False, raw: bool = False) → Union[str, requests.models.Response][source]¶
-
launch
(job_name: str, checkpoint: Optional[str] = None, raw: bool = False) → Union[str, requests.models.Response][source]¶
-
execute_script
(job_name: str, script: str, engine: str = 'beanshell', raw: bool = False) → Union[str, requests.models.Response][source]¶
-
list_files
(job_name: str, gather_files: bool = True, gather_folders: bool = True) → List[str][source]¶
-
-
heritrix3.api.
disable_ssl_warnings
()[source]¶ Quieten SSL insecure warnings.
See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
heritrix3.cli¶
heritrix3¶
CLI for the Heritrix API.
heritrix3 [OPTIONS] COMMAND [ARGS]...
Options
-
-h
,
--host
<host>
¶ Heritrix base URI
- Default
-
-u
,
--username
<username>
¶ HTTP Digest Username
- Default
admin
-
-p
,
--password
<password>
¶ HTTP Digest Password
- Default
admin
-
--version
¶
Show the version and exit.
info¶
Show information about all jobs or a single job.
If given a jobname as argument then only display information about this job.
Tries to use pygments
to colorize the output.
heritrix3 info [OPTIONS] [JOBNAME]
Options
-
--raw
¶
Output plain XML response.
Arguments
-
JOBNAME
¶
Optional argument
list-jobs¶
List jobs, allow filtering for unbuilt ones.
heritrix3 list-jobs [OPTIONS]
Options
-
--unbuilt
¶
-
--sorted
¶
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
Bug reports¶
When reporting a bug please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Documentation improvements¶
Heritrix3 Client could always use more documentation, whether as part of the official Heritrix3 Client docs, in docstrings, or even on the web in blog posts, articles, and such.
Feature requests and feedback¶
The best way to send feedback is to file an issue at https://github.com/Querela/python-heritrix3-client/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that code contributions are welcome :)
Development¶
To set up python-heritrix3-client for local development:
Fork python-heritrix3-client (look for the “Fork” button).
Clone your fork locally:
git clone git@github.com:YOURGITHUBNAME/python-heritrix3-client.git
Create a branch for local development:
git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes run all the checks and docs builder with tox one command:
tox
Commit your changes and push your branch to GitHub:
git add . git commit -m "Your detailed description of your changes." git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
If you need some code review or feedback while you’re developing the code just make the pull request.
For merging, you should:
Include passing tests (run
tox
).Update documentation when there’s new API, functionality etc.
Add a note to
CHANGELOG.rst
about the changes.Add yourself to
AUTHORS.rst
.
Tips¶
To run a subset of tests:
tox -e envname -- pytest -k test_myfeature
To run all the test environments in parallel:
tox -p auto
Authors¶
Erik Körner - koerner@informatik.uni-leipzig.de
Changelog¶
WIP¶
Tests using real Heritrix? (Coverage?)
Refactoring common code fragments.
Documentation (docstrings).
0.4.0 (2021-01-11)¶
Reorder
api
functions.Add log retrieval methods.
Add job state check +
wait_for
methods.
0.2.0 (2021-01-09)¶
Typings.
Add file download (e.g. all WARCs).
Add report retrieval.
0.1.0 (2021-01-09)¶
First release on PyPI.
Initial implementation and documentation.
0.0.0 (2021-01-09)¶
Code skeleton using
cookiecutter gh:ionelmc/cookiecutter-pylibrary