
160 lines
5.9 KiB
Raw Normal View History

# Auth route for scrapers
(Find this issue with `git grep YNQAQKJS`)
## Test curl commands
(In production the API key should be loaded from a file. Putting it in the
Bash command is bad, because it will be saved to Bash's history file. Putting
it in environment variables is slightly better)
`curl -H "X-ApiKey: $API_KEY"`
Should return "You're valid!"
`curl -H "X-ApiKey: $API_KEY"`
Should return a JSON object listing all the servers.
`curl -H "X-ApiKey: $API_KEY"`
Proxies into the "aliens_wildland" server and retrieves a JSON object listing
the file server root. (The server must be running a new version of ptth_server
which can serve the JSON API)
`curl -H "X-ApiKey: $API_KEY"`
Same, but retrieves the listing for "/src".
`curl -H "X-ApiKey: $API_KEY"`
There is no special API for retrieving files yet - But the existing server
API will be is proxied through the new scraper API on the relay.
`curl --head -H "X-ApiKey: $API_KEY"`
PTTH supports HEAD requests. This request will yield a "204 No Content", with
the "content-length" header.
`curl -H "range: bytes=100-199" -H "X-ApiKey: $API_KEY"`
PTTH supports byte range requests. This request will skip 100 bytes into the
file, and read 100 bytes.
To avoid fence-post errors, most programming languages use half-open ranges.
e.g. `0..3` means "0, 1, 2". However, HTTP byte ranges are closed ranges.
e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.
By polling with HEAD and byte range requests, a scraper client can approximate
`tail -f` behavior of a server-side file.
## Problem statement
PTTH has 2 auth routes:
- A fixed API key for servers
- Whatever the end user puts in front of the HTML client
"Whatever" is hard for scrapers to deal with. This barrier to scraping
is blocking these issues:
- EOTPXGR3 Remote `tail -f`
- UPAQ3ZPT Audit logging of the relay itself
- YMFMSV2R Add Prometheus metrics
## Proposal
Add a 3rd auth route meeting these criteria:
- Enabled by a feature flag, disabled by default
- Bootstrapped by the user-friendly HTML frontend
- Suitable for headless automated scrapers
It will probably involve an API key like the servers use. Public-key
crypto is stronger, but involves more work. I think we should plan to
start with something weak, and also plan to deprecate it once something
stronger is ready.
## Proposed impl plan
- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers
- (X) Make sure Docker release CAN build
2020-12-12 17:14:10 +00:00
- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration
- (X) Accept scraper key for some testing endpoint
- (X) (POC) Test with curl
- (X) Clean up scraper endpoint
- (X) Add (almost) end-to-end tests for test scraper endpoint
- (X) Thread server endpoints through relay scraper auth
- ( ) Add tests for other scraper endpoints
- (don't care) Factor v1 API into v1 module
- (X) Add real scraper endpoints
- ( ) Manually create SQLite DB for scraper keys, add 1 hash
- ( ) Impl DB reads
- ( ) Remove scraper key from config file
- ( ) Make sure `cargo test` passes and Docker CAN build
- ( ) (MVP) Test with curl
- ( ) Impl and test DB init / migration
- ( ) Impl DB writes (Add / revoke keys) as CLI commands
- ( ) Implement API (Behind X-Email auth) for that, test with curl
- ( ) Set up mitmproxy or something to add X-Email header in dev env
- ( ) Implement web UI (Behind X-Email)
POC is the proof-of-concept - At this point we will know that in theory the
feature can work.
MVP is the first deployable version - I could put it in prod, manually fudge
the SQLite DB to add a 1-month key, and let people start building scrapers.
Dev mode will allow anonymous users to generate scraper keys. In prod mode,
(the default) clients will need to have the X-Email header set or use a
scraper key to do anything.
Design the DB so that the servers can share it one day.
Design the API so that new types of auth / keys can be added one day, and
the old ones deprecated.
Endpoints needed:
- (X) Query server list
- (X) Query directory in server
- (not needed) GET file with byte range (identical to frontend file API)
These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it.
For compatibility with wget spidering, I _might_ do XML or HTML that's
machine-readable. We'll see.
## Decision journal
**Who generates the API key? The scraper client, or the PTTH relay server?**
The precedent from big cloud vendors seems to be that the server generates
tokens. This is probably to avoid a situation where clients with vulnerable
crypto code or just bad code generate low-entropy keys. By putting that
responsibility on the server, the server can enforce high-entropy keys.
**Should the key rotate? If so, how?**
The key should _at least_ expire. If it expires every 30 or 90 days, then a
human is slightly inconvenienced to service their scraper regularly.
When adding other features, we must consider the use cases:
1. A really dumb Bash script that shells out to curl
2. A Python script
3. A sophisticated desktop app in C#, Rust, or C++
4. Eventually replacing the fixed API keys used in ptth_server
For the Bash script, rotation will probably be difficult, and I'm okay if
our support for that is merely "It'll work for 30 days at a time, then you
need to rotate keys manually."
For the Python script, rotation could be automated, but cryptography is
still probably difficult. I think some AWS services require actual crypto
keys, and not just high-entropy password keys.
For the sophisticated desktop app, cryptography is on the table, but this
is the least likely use case to ever happen, too.