ptth/issues/2020-12Dec/auth-route-YNQAQKJS.md

4.0 KiB

Auth route for scrapers

(Find this issue with git grep YNQAQKJS)

Problem statement

PTTH has 2 auth routes:

  • A fixed API key for servers
  • Whatever the end user puts in front of the HTML client

"Whatever" is hard for scrapers to deal with. This barrier to scraping is blocking these issues:

  • EOTPXGR3 Remote tail -f
  • UPAQ3ZPT Audit logging of the relay itself
  • YMFMSV2R Add Prometheus metrics

Proposal

Add a 3rd auth route meeting these criteria:

  • Enabled by a feature flag, disabled by default
  • Bootstrapped by the user-friendly HTML frontend
  • Suitable for headless automated scrapers

It will probably involve an API key like the servers use. Public-key crypto is stronger, but involves more work. I think we should plan to start with something weak, and also plan to deprecate it once something stronger is ready.

Proposed impl plan

  • (X) Add feature flags to ptth_relay.toml for dev mode and scrapers
  • (X) Make sure Docker release CAN build
  • (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration
  • (X) Accept scraper key for some testing endpoint
  • (X) (POC) Test with curl
  • (X) Clean up scraper endpoint
  • (X) Add (almost) end-to-end tests for scraper endpoint
  • ( ) Add tests for scraper endpoints
  • ( ) Factor v1 API into v1 module
  • ( ) Add real scraper endpoints
  • ( ) Manually create SQLite DB for scraper keys, add 1 hash
  • ( ) Impl DB reads
  • ( ) Remove scraper key from config file
  • ( ) Make sure cargo test passes and Docker CAN build
  • ( ) (MVP) Test with curl
  • ( ) Impl and test DB init / migration
  • ( ) Impl DB writes (Add / revoke keys) as CLI commands
  • ( ) Implement API (Behind X-Email auth) for that, test with curl
  • ( ) Set up mitmproxy or something to add X-Email header in dev env
  • ( ) Implement web UI (Behind X-Email)

POC is the proof-of-concept - At this point we will know that in theory the feature can work.

MVP is the first deployable version - I could put it in prod, manually fudge the SQLite DB to add a 1-month key, and let people start building scrapers.

Details:

Dev mode will allow anonymous users to generate scraper keys. In prod mode, (the default) clients will need to have the X-Email header set or use a scraper key to do anything.

Design the DB so that the servers can share it one day.

Design the API so that new types of auth / keys can be added one day, and the old ones deprecated.

Endpoints needed:

  • (X) Query server list
  • ( ) Query directory in server
  • ( ) GET file with byte range (identical to frontend file API)

These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it. For compatibility with wget spidering, I might do XML or HTML that's machine-readable. We'll see.

Open questions

Who generates the API key? The scraper client, or the PTTH relay server?

The precedent from big cloud vendors seems to be that the server generates tokens. This is probably to avoid a situation where clients with vulnerable crypto code or just bad code generate low-entropy keys. By putting that responsibility on the server, the server can enforce high-entropy keys.

Should the key rotate? If so, how?

The key should at least expire. If it expires every 30 or 90 days, then a human is slightly inconvenienced to service their scraper regularly.

When adding other features, we must consider the use cases:

  1. A really dumb Bash script that shells out to curl
  2. A Python script
  3. A sophisticated desktop app in C#, Rust, or C++
  4. Eventually replacing the fixed API keys used in ptth_server

For the Bash script, rotation will probably be difficult, and I'm okay if our support for that is merely "It'll work for 30 days at a time, then you need to rotate keys manually."

For the Python script, rotation could be automated, but cryptography is still probably difficult. I think some AWS services require actual crypto keys, and not just high-entropy password keys.

For the sophisticated desktop app, cryptography is on the table, but this is the least likely use case to ever happen, too.