ptth/issues/2020-12Dec/auth-route-YNQAQKJS.md

# Auth route for scrapers

(Find this issue with `git grep YNQAQKJS`)

## Test curl commands

Export the scraper API's URL prefix to an environment variable:

`export API=http://127.0.0.1:4000/scraper`

Put your API key into a header file, like this:

```
X-ApiKey: bad_password
```

Call it "scraper-secret.txt" or something else obviously secret.
Don't check it into Git. The key will expire every 30 days and need
to be rotated manually. (for now)

New versions of Curl can load headers from a text file. All commands
will use this feature to load the API key.

`curl --header @scraper-secret.txt $API/api/test`

Should return "You're valid!"

`curl --header @scraper-secret.txt $API/v1/server_list`

Should return a JSON object listing all the servers.

`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/`

Proxies into the "aliens_wildland" server and retrieves a JSON object listing
the file server root. (The server must be running a new version of ptth_server
which can serve the JSON API)

`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/src/`

Same, but retrieves the listing for "/src".

`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`

There is no special API for retrieving files yet - But the existing server
API will be is proxied through the new scraper API on the relay.

`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`

PTTH supports HEAD requests. This request will yield a "204 No Content", with
the "content-length" header.

`curl --header @scraper-secret.txt -H "range: bytes=100-199" $API/v1/server/aliens_wildland/files/src/tests.rs`

PTTH supports byte range requests. This request will skip 100 bytes into the
file, and read 100 bytes.

To avoid fence-post errors, most programming languages use half-open ranges.
e.g. `0..3` means "0, 1, 2". However, HTTP byte ranges are closed ranges.
e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.

By polling with HEAD and byte range requests, a scraper client can approximate
`tail -f` behavior of a server-side file.

`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`

Use wget's recursive spidering to download all the files in a folder.
The human-friendly HTML interface is exposed through the scraper
API, so this will also download the HTML directory listings.

- `--continue` uses the server's content-length header to skip over
files that are already fully downloaded to local disk. Partial
downloads will be resumed where they left off, which is fine
for long-running log files that may append new data but not
modify old data.
- `--execute robots=off` disables wget's handling of robots.txt.
We know we're a robot, the server doesn't care, it's fine.
- `--no-parent` prevents the `../` links from accidentally causing
infinite recursion.
- `--recursive` causes wget to recurse into individual files, and
into subdirectories.
- `--header $(<scraper-secret.txt)` tells Bash to load the 
secret API key from disk and send it to wget. The secret will
leak into the process list, but at least it won't leak into
your bash_history file.

## Problem statement

PTTH has 2 auth routes:

- A fixed API key for servers
- Whatever the end user puts in front of the HTML client

"Whatever" is hard for scrapers to deal with. This barrier to scraping
is blocking these issues:

- EOTPXGR3 Remote `tail -f`
- UPAQ3ZPT Audit logging of the relay itself
- YMFMSV2R Add Prometheus metrics

## Proposal

Add a 3rd auth route meeting these criteria:

- Enabled by a feature flag, disabled by default
- Bootstrapped by the user-friendly HTML frontend
- Suitable for headless automated scrapers

It will probably involve an API key like the servers use. Public-key
crypto is stronger, but involves more work. I think we should plan to 
start with something weak, and also plan to deprecate it once something
stronger is ready.

## Proposed impl plan

- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers
- (X) Make sure Docker release CAN build
- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration
- (X) Accept scraper key for some testing endpoint
- (X) (POC) Test with curl
- (X) Clean up scraper endpoint
- (X) Add (almost) end-to-end tests for test scraper endpoint
- (X) Thread server endpoints through relay scraper auth
- (don't care) Add tests for other scraper endpoints
- (don't care) Factor v1 API into v1 module
- (X) Add real scraper endpoints
- ( ) Manually create SQLite DB for scraper keys, add 1 hash
- ( ) Impl DB reads
- ( ) Remove scraper key from config file
- ( ) Make sure `cargo test` passes and Docker CAN build
- ( ) (MVP) Test with curl
- ( ) Impl and test DB init / migration
- ( ) Impl DB writes (Add / revoke keys) as CLI commands
- ( ) Implement API (Behind X-Email auth) for that, test with curl
- ( ) Set up mitmproxy or something to add X-Email header in dev env
- ( ) Implement web UI (Behind X-Email)

POC is the proof-of-concept - At this point we will know that in theory the
feature can work.

MVP is the first deployable version - I could put it in prod, manually fudge
the SQLite DB to add a 1-month key, and let people start building scrapers.

Details:

Dev mode will allow anonymous users to generate scraper keys. In prod mode,
(the default) clients will need to have the X-Email header set or use a
scraper key to do anything.

Design the DB so that the servers can share it one day.

Design the API so that new types of auth / keys can be added one day, and
the old ones deprecated.

Endpoints needed:

- (X) Query server list
- (X) Query directory in server
- (not needed) GET file with byte range (identical to frontend file API)

These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it.
For compatibility with wget spidering, I _might_ do XML or HTML that's
machine-readable. We'll see.

## DB / UI impl

Sprint 1:

- Look up keys by their hash
- not_before
- not_after
- name
- X-Email associated with key

Sprint 2:

- UI to generate / revoke keys

## SQL schema

Migration

```
create table scraper_keys (
	hash text primary key,        -- Using blake3 for this because it's not a password
	not_before integer not null,  -- Seconds since epoch 
	not_after integer not null,   -- Seconds since epoch
	name text not null,           -- Human-friendly nickname
	email text not null           -- Email address that created the key
);
```

Look up hash

```
select not_before, not_after name, email 
from scraper_keys 
where 
	hash = $1 and 
	strftime ('%s') >= not_before and 
	strftime ('%s') < not_after
;
```

Create key

```
-- Generate entropy in app code
insert into scraper_keys (
	hash,
	not_before,
	not_after,
	name,
	email
) values (
	$1, 
	strftime ('%s'),
	strftime ('%s') + 2592000,
	$4,
	$5
);

-- Respond to client with plaintext key and then forget it.
-- If a network blip causes the key to evaporate, the client should revoke it.
```

Revoke key

```

```

## Decision journal

**Who generates the API key? The scraper client, or the PTTH relay server?**

The precedent from big cloud vendors seems to be that the server generates
tokens. This is probably to avoid a situation where clients with vulnerable
crypto code or just bad code generate low-entropy keys. By putting that
responsibility on the server, the server can enforce high-entropy keys.

**Should the key rotate? If so, how?**

The key should _at least_ expire. If it expires every 30 or 90 days, then a
human is slightly inconvenienced to service their scraper regularly.

When adding other features, we must consider the use cases:

1. A really dumb Bash script that shells out to curl
2. A Python script
3. A sophisticated desktop app in C#, Rust, or C++
4. Eventually replacing the fixed API keys used in ptth_server

For the Bash script, rotation will probably be difficult, and I'm okay if
our support for that is merely "It'll work for 30 days at a time, then you
need to rotate keys manually."

For the Python script, rotation could be automated, but cryptography is
still probably difficult. I think some AWS services require actual crypto
keys, and not just high-entropy password keys.

For the sophisticated desktop app, cryptography is on the table, but this
is the least likely use case to ever happen, too.
:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00			`# Auth route for scrapers`

			(Find this issue with `git grep YNQAQKJS`)

:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00			`## Test curl commands`

Update auth-route-YNQAQKJS.md 2021-03-22 14:58:23 +00:00			`Export the scraper API's URL prefix to an environment variable:`

			`export API=http://127.0.0.1:4000/scraper`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`Put your API key into a header file, like this:`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			```
			`X-ApiKey: bad_password`
			```

			`Call it "scraper-secret.txt" or something else obviously secret.`
			`Don't check it into Git. The key will expire every 30 days and need`
			`to be rotated manually. (for now)`

			`New versions of Curl can load headers from a text file. All commands`
			`will use this feature to load the API key.`

			`curl --header @scraper-secret.txt $API/api/test`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`Should return "You're valid!"`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt $API/v1/server_list`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`Should return a JSON object listing all the servers.`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`Proxies into the "aliens_wildland" server and retrieves a JSON object listing`
			`the file server root. (The server must be running a new version of ptth_server`
			`which can serve the JSON API)`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/src/`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`Same, but retrieves the listing for "/src".`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`There is no special API for retrieving files yet - But the existing server`
			`API will be is proxied through the new scraper API on the relay.`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`PTTH supports HEAD requests. This request will yield a "204 No Content", with`
			`the "content-length" header.`

:pencil: docs: update example curl commands 2020-12-16 16:33:03 +00:00			`curl --header @scraper-secret.txt -H "range: bytes=100-199" $API/v1/server/aliens_wildland/files/src/tests.rs`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00
			`PTTH supports byte range requests. This request will skip 100 bytes into the`
			`file, and read 100 bytes.`

			`To avoid fence-post errors, most programming languages use half-open ranges.`
			e.g. `0..3` means "0, 1, 2". However, HTTP byte ranges are closed ranges.
			e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.

			`By polling with HEAD and byte range requests, a scraper client can approximate`
			`tail -f` behavior of a server-side file.

:pencil: docs: document wget spidering 2020-12-22 04:16:12 +00:00			`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`

			`Use wget's recursive spidering to download all the files in a folder.`
			`The human-friendly HTML interface is exposed through the scraper`
			`API, so this will also download the HTML directory listings.`

			- `--continue` uses the server's content-length header to skip over
			`files that are already fully downloaded to local disk. Partial`
			`downloads will be resumed where they left off, which is fine`
			`for long-running log files that may append new data but not`
			`modify old data.`
			- `--execute robots=off` disables wget's handling of robots.txt.
			`We know we're a robot, the server doesn't care, it's fine.`
			- `--no-parent` prevents the `../` links from accidentally causing
			`infinite recursion.`
			- `--recursive` causes wget to recurse into individual files, and
			`into subdirectories.`
			- `--header $(<scraper-secret.txt)` tells Bash to load the
			`secret API key from disk and send it to wget. The secret will`
			`leak into the process list, but at least it won't leak into`
			`your bash_history file.`

:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00			`## Problem statement`

			`PTTH has 2 auth routes:`

			`- A fixed API key for servers`
			`- Whatever the end user puts in front of the HTML client`

			`"Whatever" is hard for scrapers to deal with. This barrier to scraping`
			`is blocking these issues:`

			- EOTPXGR3 Remote `tail -f`
			`- UPAQ3ZPT Audit logging of the relay itself`
			`- YMFMSV2R Add Prometheus metrics`

			`## Proposal`

			`Add a 3rd auth route meeting these criteria:`

			`- Enabled by a feature flag, disabled by default`
			`- Bootstrapped by the user-friendly HTML frontend`
			`- Suitable for headless automated scrapers`

			`It will probably involve an API key like the servers use. Public-key`
			`crypto is stronger, but involves more work. I think we should plan to`
			`start with something weak, and also plan to deprecate it once something`
			`stronger is ready.`

			`## Proposed impl plan`

:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			`- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers`
:whale: build (ptth_relay): clean up Docker build process The new method is much nicer and doesn't require the manual make-old-git step. The top-level command is actually build_and_minimize.bash, which uses `git archive` to unpack the last Git commit and build with _that_ Dockerfile and Docker context. This is better for determinism. It's similar to our build process for that one big project at work. 2020-12-12 01:53:20 +00:00			`- (X) Make sure Docker release CAN build`
:pencil: docs: update plan 2020-12-12 17:14:10 +00:00			`- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration`
:star: new (ptth_relay): add test endpoint for scrapers Scrapers can auth using a shared (but hashed) API key. The hash of the key is specified in ptth_relay.toml, and forces dev mode on. 2020-12-12 17:50:40 +00:00			`- (X) Accept scraper key for some testing endpoint`
			`- (X) (POC) Test with curl`
:white_check_mark: test: add end-to-end test for scraper API 2020-12-13 01:54:54 +00:00			`- (X) Clean up scraper endpoint`
:star: new: add JSON API in server for dir listings 2020-12-15 05:15:17 +00:00			`- (X) Add (almost) end-to-end tests for test scraper endpoint`
:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00			`- (X) Thread server endpoints through relay scraper auth`
:pencil: docs: planning auth route 2020-12-21 14:19:50 +00:00			`- (don't care) Add tests for other scraper endpoints`
:star: new: add JSON API in server for dir listings 2020-12-15 05:15:17 +00:00			`- (don't care) Factor v1 API into v1 module`
			`- (X) Add real scraper endpoints`
:pencil: docs: improve plan for scraper keys 2020-12-12 15:10:14 +00:00			`- ( ) Manually create SQLite DB for scraper keys, add 1 hash`
:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			`- ( ) Impl DB reads`
:pencil: docs: improve plan for scraper keys 2020-12-12 15:10:14 +00:00			`- ( ) Remove scraper key from config file`
:wrench: config (ptth_relay): add feature flags - dev mode - scraper auth These will gate features I'm adding soon. 2020-12-12 01:26:58 +00:00			- ( ) Make sure `cargo test` passes and Docker CAN build
			`- ( ) (MVP) Test with curl`
			`- ( ) Impl and test DB init / migration`
			`- ( ) Impl DB writes (Add / revoke keys) as CLI commands`
			`- ( ) Implement API (Behind X-Email auth) for that, test with curl`
			`- ( ) Set up mitmproxy or something to add X-Email header in dev env`
			`- ( ) Implement web UI (Behind X-Email)`
:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00
			`POC is the proof-of-concept - At this point we will know that in theory the`
			`feature can work.`

			`MVP is the first deployable version - I could put it in prod, manually fudge`
			`the SQLite DB to add a 1-month key, and let people start building scrapers.`

			`Details:`

			`Dev mode will allow anonymous users to generate scraper keys. In prod mode,`
			`(the default) clients will need to have the X-Email header set or use a`
			`scraper key to do anything.`

			`Design the DB so that the servers can share it one day.`

			`Design the API so that new types of auth / keys can be added one day, and`
			`the old ones deprecated.`

:pencil: docs: plan remaining tasks on scraper API 2020-12-13 05:04:04 +00:00			`Endpoints needed:`

:pencil: docs: update scraper auth todo 2020-12-14 07:08:00 +00:00			`- (X) Query server list`
:star: new: add JSON API in server for dir listings 2020-12-15 05:15:17 +00:00			`- (X) Query directory in server`
			`- (not needed) GET file with byte range (identical to frontend file API)`
:pencil: docs: plan remaining tasks on scraper API 2020-12-13 05:04:04 +00:00
			`These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it.`
			`For compatibility with wget spidering, I _might_ do XML or HTML that's`
			`machine-readable. We'll see.`

:pencil: docs: planning auth route 2020-12-21 14:19:50 +00:00			`## DB / UI impl`

			`Sprint 1:`

			`- Look up keys by their hash`
			`- not_before`
			`- not_after`
			`- name`
			`- X-Email associated with key`

			`Sprint 2:`

			`- UI to generate / revoke keys`

			`## SQL schema`

			`Migration`

			```
			`create table scraper_keys (`
			`hash text primary key, -- Using blake3 for this because it's not a password`
			`not_before integer not null, -- Seconds since epoch`
			`not_after integer not null, -- Seconds since epoch`
			`name text not null, -- Human-friendly nickname`
			`email text not null -- Email address that created the key`
			`);`
			```

			`Look up hash`

			```
			`select not_before, not_after name, email`
			`from scraper_keys`
			`where`
			`hash = $1 and`
			`strftime ('%s') >= not_before and`
			`strftime ('%s') < not_after`
			`;`
			```

			`Create key`

			```
			`-- Generate entropy in app code`
			`insert into scraper_keys (`
			`hash,`
			`not_before,`
			`not_after,`
			`name,`
			`email`
			`) values (`
			`$1,`
			`strftime ('%s'),`
			`strftime ('%s') + 2592000,`
			`$4,`
			`$5`
			`);`

			`-- Respond to client with plaintext key and then forget it.`
			`-- If a network blip causes the key to evaporate, the client should revoke it.`
			```

			`Revoke key`

			```

			```

:star: new: finish MVP for scraper auth. Adding a SQLite DB to properly track the keys is going to take a while. For now I'll just keep them in the config file and give them 30-day expirations. 2020-12-16 14:46:03 +00:00			`## Decision journal`
:pencil: docs (YNQAQKJS) add plan for 3rd auth route 2020-12-11 21:04:59 +00:00
			`Who generates the API key? The scraper client, or the PTTH relay server?`

			`The precedent from big cloud vendors seems to be that the server generates`
			`tokens. This is probably to avoid a situation where clients with vulnerable`
			`crypto code or just bad code generate low-entropy keys. By putting that`
			`responsibility on the server, the server can enforce high-entropy keys.`

			`Should the key rotate? If so, how?`

			`The key should _at least_ expire. If it expires every 30 or 90 days, then a`
			`human is slightly inconvenienced to service their scraper regularly.`

			`When adding other features, we must consider the use cases:`

			`1. A really dumb Bash script that shells out to curl`
			`2. A Python script`
			`3. A sophisticated desktop app in C#, Rust, or C++`
			`4. Eventually replacing the fixed API keys used in ptth_server`

			`For the Bash script, rotation will probably be difficult, and I'm okay if`
			`our support for that is merely "It'll work for 30 days at a time, then you`
			`need to rotate keys manually."`

			`For the Python script, rotation could be automated, but cryptography is`
			`still probably difficult. I think some AWS services require actual crypto`
			`keys, and not just high-entropy password keys.`

			`For the sophisticated desktop app, cryptography is on the table, but this`
			`is the least likely use case to ever happen, too.`