# Auth route for scrapers (Find this issue with `git grep YNQAQKJS`) ## Problem statement PTTH has 2 auth routes: - A fixed API key for servers - Whatever the end user puts in front of the HTML client "Whatever" is hard for scrapers to deal with. This barrier to scraping is blocking these issues: - EOTPXGR3 Remote `tail -f` - UPAQ3ZPT Audit logging of the relay itself - YMFMSV2R Add Prometheus metrics ## Proposal Add a 3rd auth route meeting these criteria: - Enabled by a feature flag, disabled by default - Bootstrapped by the user-friendly HTML frontend - Suitable for headless automated scrapers It will probably involve an API key like the servers use. Public-key crypto is stronger, but involves more work. I think we should plan to start with something weak, and also plan to deprecate it once something stronger is ready. ## Proposed impl plan - (X) Add feature flags to ptth_relay.toml for dev mode and scrapers - (X) Make sure Docker release CAN build - (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration - (X) Accept scraper key for some testing endpoint - (X) (POC) Test with curl - (X) Clean up scraper endpoint - (X) Add (almost) end-to-end tests for scraper endpoint - ( ) Add real scraper endpoints - ( ) Manually create SQLite DB for scraper keys, add 1 hash - ( ) Impl DB reads - ( ) Remove scraper key from config file - ( ) Make sure `cargo test` passes and Docker CAN build - ( ) (MVP) Test with curl - ( ) Impl and test DB init / migration - ( ) Impl DB writes (Add / revoke keys) as CLI commands - ( ) Implement API (Behind X-Email auth) for that, test with curl - ( ) Set up mitmproxy or something to add X-Email header in dev env - ( ) Implement web UI (Behind X-Email) POC is the proof-of-concept - At this point we will know that in theory the feature can work. MVP is the first deployable version - I could put it in prod, manually fudge the SQLite DB to add a 1-month key, and let people start building scrapers. Details: Dev mode will allow anonymous users to generate scraper keys. In prod mode, (the default) clients will need to have the X-Email header set or use a scraper key to do anything. Design the DB so that the servers can share it one day. Design the API so that new types of auth / keys can be added one day, and the old ones deprecated. Endpoints needed: - ( ) Query server list - ( ) Query directory in server - ( ) GET file with byte range (identical to frontend file API) These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it. For compatibility with wget spidering, I _might_ do XML or HTML that's machine-readable. We'll see. ## Open questions **Who generates the API key? The scraper client, or the PTTH relay server?** The precedent from big cloud vendors seems to be that the server generates tokens. This is probably to avoid a situation where clients with vulnerable crypto code or just bad code generate low-entropy keys. By putting that responsibility on the server, the server can enforce high-entropy keys. **Should the key rotate? If so, how?** The key should _at least_ expire. If it expires every 30 or 90 days, then a human is slightly inconvenienced to service their scraper regularly. When adding other features, we must consider the use cases: 1. A really dumb Bash script that shells out to curl 2. A Python script 3. A sophisticated desktop app in C#, Rust, or C++ 4. Eventually replacing the fixed API keys used in ptth_server For the Bash script, rotation will probably be difficult, and I'm okay if our support for that is merely "It'll work for 30 days at a time, then you need to rotate keys manually." For the Python script, rotation could be automated, but cryptography is still probably difficult. I think some AWS services require actual crypto keys, and not just high-entropy password keys. For the sophisticated desktop app, cryptography is on the table, but this is the least likely use case to ever happen, too.