ptth/issues/2020-12Dec/auth-route-YNQAQKJS.md

# Auth route for scrapers

(Find this issue with `git grep YNQAQKJS`)

## Problem statement

PTTH has 2 auth routes:

- A fixed API key for servers
- Whatever the end user puts in front of the HTML client

"Whatever" is hard for scrapers to deal with. This barrier to scraping
is blocking these issues:

- EOTPXGR3 Remote `tail -f`
- UPAQ3ZPT Audit logging of the relay itself
- YMFMSV2R Add Prometheus metrics

## Proposal

Add a 3rd auth route meeting these criteria:

- Enabled by a feature flag, disabled by default
- Bootstrapped by the user-friendly HTML frontend
- Suitable for headless automated scrapers

It will probably involve an API key like the servers use. Public-key
crypto is stronger, but involves more work. I think we should plan to 
start with something weak, and also plan to deprecate it once something
stronger is ready.

## Proposed impl plan

- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers
- (X) Make sure Docker release CAN build
- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration
- (X) Accept scraper key for some testing endpoint
- (X) (POC) Test with curl
- ( ) Clean up scraper endpoint
- ( ) Manually create SQLite DB for scraper keys, add 1 hash
- ( ) Impl DB reads
- ( ) Remove scraper key from config file
- ( ) Make sure `cargo test` passes and Docker CAN build
- ( ) (MVP) Test with curl
- ( ) Impl and test DB init / migration
- ( ) Impl DB writes (Add / revoke keys) as CLI commands
- ( ) Implement API (Behind X-Email auth) for that, test with curl
- ( ) Set up mitmproxy or something to add X-Email header in dev env
- ( ) Implement web UI (Behind X-Email)

POC is the proof-of-concept - At this point we will know that in theory the
feature can work.

MVP is the first deployable version - I could put it in prod, manually fudge
the SQLite DB to add a 1-month key, and let people start building scrapers.

Details:

Dev mode will allow anonymous users to generate scraper keys. In prod mode,
(the default) clients will need to have the X-Email header set or use a
scraper key to do anything.

Design the DB so that the servers can share it one day.

Design the API so that new types of auth / keys can be added one day, and
the old ones deprecated.

## Open questions

**Who generates the API key? The scraper client, or the PTTH relay server?**

The precedent from big cloud vendors seems to be that the server generates
tokens. This is probably to avoid a situation where clients with vulnerable
crypto code or just bad code generate low-entropy keys. By putting that
responsibility on the server, the server can enforce high-entropy keys.

**Should the key rotate? If so, how?**

The key should _at least_ expire. If it expires every 30 or 90 days, then a
human is slightly inconvenienced to service their scraper regularly.

When adding other features, we must consider the use cases:

1. A really dumb Bash script that shells out to curl
2. A Python script
3. A sophisticated desktop app in C#, Rust, or C++
4. Eventually replacing the fixed API keys used in ptth_server

For the Bash script, rotation will probably be difficult, and I'm okay if
our support for that is merely "It'll work for 30 days at a time, then you
need to rotate keys manually."

For the Python script, rotation could be automated, but cryptography is
still probably difficult. I think some AWS services require actual crypto
keys, and not just high-entropy password keys.

For the sophisticated desktop app, cryptography is on the table, but this
is the least likely use case to ever happen, too.
:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00			`# Auth route for scrapers`

			(Find this issue with `git grep YNQAQKJS`)

			`## Problem statement`

			`PTTH has 2 auth routes:`

			`- A fixed API key for servers`
			`- Whatever the end user puts in front of the HTML client`

			`"Whatever" is hard for scrapers to deal with. This barrier to scraping`
			`is blocking these issues:`

			- EOTPXGR3 Remote `tail -f`
			`- UPAQ3ZPT Audit logging of the relay itself`
			`- YMFMSV2R Add Prometheus metrics`

			`## Proposal`

			`Add a 3rd auth route meeting these criteria:`

			`- Enabled by a feature flag, disabled by default`
			`- Bootstrapped by the user-friendly HTML frontend`
			`- Suitable for headless automated scrapers`

			`It will probably involve an API key like the servers use. Public-key`
			`crypto is stronger, but involves more work. I think we should plan to`
			`start with something weak, and also plan to deprecate it once something`
			`stronger is ready.`

			`## Proposed impl plan`

:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			`- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers`
:whale: build (ptth_relay): clean up Docker build process The new method is much nicer and doesn't require the manual make-old-git step. The top-level command is actually build_and_minimize.bash, which uses `git archive` to unpack the last Git commit and build with _that_ Dockerfile and Docker context. This is better for determinism. It's similar to our build process for that one big project at work. 2020-12-12 01:53:20 +00:00			`- (X) Make sure Docker release CAN build`
:pencil: docs: update plan 2020-12-12 17:14:10 +00:00			`- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration`
:star: new (ptth_relay): add test endpoint for scrapers Scrapers can auth using a shared (but hashed) API key. The hash of the key is specified in ptth_relay.toml, and forces dev mode on. 2020-12-12 17:50:40 +00:00			`- (X) Accept scraper key for some testing endpoint`
			`- (X) (POC) Test with curl`
			`- ( ) Clean up scraper endpoint`
:pencil: docs: improve plan for scraper keys 2020-12-12 15:10:14 +00:00			`- ( ) Manually create SQLite DB for scraper keys, add 1 hash`
:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			`- ( ) Impl DB reads`
:pencil: docs: improve plan for scraper keys 2020-12-12 15:10:14 +00:00			`- ( ) Remove scraper key from config file`
:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			- ( ) Make sure `cargo test` passes and Docker CAN build
			`- ( ) (MVP) Test with curl`
			`- ( ) Impl and test DB init / migration`
			`- ( ) Impl DB writes (Add / revoke keys) as CLI commands`
			`- ( ) Implement API (Behind X-Email auth) for that, test with curl`
			`- ( ) Set up mitmproxy or something to add X-Email header in dev env`
			`- ( ) Implement web UI (Behind X-Email)`
:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00
			`POC is the proof-of-concept - At this point we will know that in theory the`
			`feature can work.`

			`MVP is the first deployable version - I could put it in prod, manually fudge`
			`the SQLite DB to add a 1-month key, and let people start building scrapers.`

			`Details:`

			`Dev mode will allow anonymous users to generate scraper keys. In prod mode,`
			`(the default) clients will need to have the X-Email header set or use a`
			`scraper key to do anything.`

			`Design the DB so that the servers can share it one day.`

			`Design the API so that new types of auth / keys can be added one day, and`
			`the old ones deprecated.`

			`## Open questions`

			`Who generates the API key? The scraper client, or the PTTH relay server?`

			`The precedent from big cloud vendors seems to be that the server generates`
			`tokens. This is probably to avoid a situation where clients with vulnerable`
			`crypto code or just bad code generate low-entropy keys. By putting that`
			`responsibility on the server, the server can enforce high-entropy keys.`

			`Should the key rotate? If so, how?`

			`The key should _at least_ expire. If it expires every 30 or 90 days, then a`
			`human is slightly inconvenienced to service their scraper regularly.`

			`When adding other features, we must consider the use cases:`

			`1. A really dumb Bash script that shells out to curl`
			`2. A Python script`
			`3. A sophisticated desktop app in C#, Rust, or C++`
			`4. Eventually replacing the fixed API keys used in ptth_server`

			`For the Bash script, rotation will probably be difficult, and I'm okay if`
			`our support for that is merely "It'll work for 30 days at a time, then you`
			`need to rotate keys manually."`

			`For the Python script, rotation could be automated, but cryptography is`
			`still probably difficult. I think some AWS services require actual crypto`
			`keys, and not just high-entropy password keys.`

			`For the sophisticated desktop app, cryptography is on the table, but this`
			`is the least likely use case to ever happen, too.`