Batch large-table loads to bound memory and add per-table state to stats

This commit is contained in:
Jan Doubravský
2026-06-05 14:44:07 +02:00
parent 85bb84a1a6
commit 286a5f207d
11 changed files with 436 additions and 29 deletions
+26 -1
View File
@@ -245,9 +245,33 @@ Use `reset()` after a **structural change** in the source (columns added/removed
stats = engine.stats # Stats snapshot
print(stats.hits, stats.misses, stats.refetches)
for name, t in stats.tables.items():
print(name, t.rows, t.columns, t.last_refresh)
print(name, t.rows, t.state, t.tracking, t.last_refresh)
```
Each `TableStats` reports a live processing **state** and how the table is kept fresh (**tracking**):
| `state` | Meaning |
|---|---|
| `loading` | a full load is in progress |
| `refreshing` | an incremental (delta) refresh is in progress |
| `ready` | cached and idle (up to date) |
| `stale` | a TTL table whose cache has expired; reloads on next access |
| `error` | the last load failed |
| `tracking` | Meaning |
|---|---|
| `delta` | kept in sync incrementally via a change column |
| `ttl` | full-reloaded when older than its TTL |
| `static` | loaded on demand, never auto-refreshed |
## Memory and very large tables
The cache is **in-memory SQLite**, so a cached table lives in RAM — it must fit in available memory. To keep huge tables manageable:
- **Loads are streamed in batches** (`SQLMEM_FETCH_BATCH` rows at a time, default 10 000) into a staging table and swapped in atomically. A multi-million-row table never gets fully materialized in Python at once, so the load doesn't spike memory or crash the process, and readers keep seeing the previous copy until the swap completes.
- Use **[delta refresh](#incremental-delta-refresh)** for large tables that have a change column — after the first load only changed rows are pulled, so restarts and refreshes don't re-read the whole table.
- A **single query that returns a huge result set** (e.g. `SELECT *` over a multi-million-row cached table) still materializes that result as a list of dicts; bound it with a `WHERE`/`LIMIT` rather than selecting everything.
## Configuration
Set via environment variables or a `.env` file:
@@ -259,6 +283,7 @@ Set via environment variables or a `.env` file:
| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds |
| `SQLMEM_SQL_DIALECT` | `tsql` | sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`) |
| `SQLMEM_REFRESH_INTERVAL` | `300` | background refresh tick (seconds) — delta pulls and proactive TTL reloads |
| `SQLMEM_FETCH_BATCH` | `10000` | rows fetched per batch when loading a table — caps peak memory for huge tables |
## Exceptions