Add disk-backed SQLite cache mode as an alternative to in-memory

This commit is contained in:
Jan Doubravský
2026-06-08 11:39:04 +02:00
parent 757a8f4eba
commit 209ae667ab
10 changed files with 280 additions and 67 deletions
+19 -3
View File
@@ -240,7 +240,7 @@ Each value is a list of index definitions: a string is a single-column index, a
## Persistence
The in-memory cache is persisted to `cache.db` on disk:
By default the cache lives in an **in-memory SQLite** and is persisted to `cache.db` on disk:
- **On startup**: if `cache.db` exists, it is loaded into memory.
- **Periodically**: a background thread writes a snapshot to disk every `SQLMEM_BACKUP_INTERVAL` seconds.
@@ -248,6 +248,20 @@ The in-memory cache is persisted to `cache.db` on disk:
The schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.
### Disk-backed cache (no RAM copy)
Set `in_memory=False` (or `SQLMEM_IN_MEMORY=false`) to query the on-disk `cache.db` **directly** instead of mirroring it in RAM:
```python
engine = CachingEngine(base_engine, in_memory=False)
```
- The cache can **exceed available memory** — nothing is held in RAM beyond SQLite's page cache.
- Every write **persists immediately** (WAL + `synchronous=NORMAL`), so there is no hourly backup thread, no load-into-memory step on startup, and no shutdown flush to lose.
- On open, a cache file with a mismatched schema version is wiped in place and rebuilt; `engine.reset()` drops the cached tables and `VACUUM`s the file (it does not delete the open file).
The constructor argument wins over the env var; when `in_memory` is omitted it falls back to `SQLMEM_IN_MEMORY`.
## Manual cache control
```python
@@ -286,8 +300,9 @@ Each `TableStats` reports a live processing **state** and how the table is kept
## Memory and very large tables
The cache is **in-memory SQLite**, so a cached table lives in RAM — it must fit in available memory. To keep huge tables manageable:
By default the cache is **in-memory SQLite**, so a cached table lives in RAM — it must fit in available memory. To keep huge tables manageable:
- **Use [disk-backed mode](#disk-backed-cache-no-ram-copy)** (`in_memory=False`) when the working set simply doesn't fit in RAM — queries then run against `cache.db` on disk instead of a memory copy.
- **Loads are streamed in batches** (`SQLMEM_FETCH_BATCH` rows at a time, default 10 000) into a staging table and swapped in atomically. A multi-million-row table never gets fully materialized in Python at once, so the load doesn't spike memory or crash the process, and readers keep seeing the previous copy until the swap completes.
- Use **[delta refresh](#incremental-delta-refresh)** for large tables that have a change column — after the first load only changed rows are pulled, so restarts and refreshes don't re-read the whole table.
- A **single query that returns a huge result set** (e.g. `SELECT *` over a multi-million-row cached table) still materializes that result as a list of dicts; bound it with a `WHERE`/`LIMIT` rather than selecting everything.
@@ -300,7 +315,8 @@ Set via environment variables or a `.env` file:
|---|---|---|
| `SQLMEM_DEBUG` | `false` | `true` enables DEBUG-level logging |
| `SQLMEM_CACHE_DB` | `cache.db` | Path to the on-disk persistence file |
| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds |
| `SQLMEM_IN_MEMORY` | `true` | `false` queries `cache.db` on disk directly (no RAM copy); overridden by the `in_memory` constructor arg |
| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds (in-memory mode only) |
| `SQLMEM_SQL_DIALECT` | `tsql` | sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`) |
| `SQLMEM_REFRESH_INTERVAL` | `300` | background refresh tick (seconds) — delta pulls and proactive TTL reloads |
| `SQLMEM_FETCH_BATCH` | `10000` | rows fetched per batch when loading a table — caps peak memory for huge tables |