66 lines
2.4 KiB
Markdown
66 lines
2.4 KiB
Markdown
# Table extraction from scrolling video
|
||
|
||
Extract a table from a screen-recorded video: sample frames, OCR (Portuguese), align to column/row bounds from an SVG template, then merge and deduplicate into one CSV.
|
||
|
||
## Inputs
|
||
|
||
| File | Role |
|
||
|------|------|
|
||
| `video-data.webm` | Source video (scroll-down table). |
|
||
| `template.svg` | Annotation: rectangles under the image = column bounds; 28 rows. |
|
||
|
||
## Outputs
|
||
|
||
| File / dir | Role |
|
||
|------------|------|
|
||
| `result.csv` | Final table: 6 columns (Concessionária, Código, Rodovia/UF, km inicial, km final, Extensão), one row per segment, deduplicated. |
|
||
| `frames/` | Working: one PNG and one CSV per sampled frame (e.g. 0, 10, 20, …). |
|
||
| `llm-fixed-frames/` | Optional: copy of frame CSVs after manual/LLM fixes; sew from here instead of `frames/` if used. |
|
||
|
||
## Run (same chore again)
|
||
|
||
**Fully automated (extract → fix → sew → result):**
|
||
|
||
```bash
|
||
nix-shell -p python3 python3Packages.opencv4 python3Packages.numpy python3Packages.pytesseract tesseract ffmpeg --run "./run.sh"
|
||
```
|
||
|
||
Or with custom video / output dir:
|
||
|
||
```bash
|
||
./run.sh path/to/video.webm my_frames
|
||
# result.csv is still written at project root
|
||
```
|
||
|
||
**If you do manual fixes:** copy `frames/*.csv` into `llm-fixed-frames/`, edit CSVs, then:
|
||
|
||
```bash
|
||
python3 sew_csvs.py llm-fixed-frames result.csv
|
||
```
|
||
|
||
## Scripts (keep)
|
||
|
||
| Script | Role |
|
||
|--------|------|
|
||
| `extract_frames_and_tables.py` | Sample video every N frames → PNGs; OCR (por) + SVG column/row bounds → one CSV per frame. |
|
||
| `fix_all_csvs.py` | Heuristic fixes on frame CSVs (strip, E→-, pipe→space, extensão from km). |
|
||
| `sew_csvs.py` | Merge frame CSVs in order, remove boundary overlap, deduplicate full rows, write result. |
|
||
| `svg_columns.py` | Parse column/row rectangles from template.svg. |
|
||
| `assign_cells.py` | Map word boxes to (col, row); merge cell text (col1 no space, others space). |
|
||
| `clean_csv_heuristics.py` | Per-row cleanup and extensão correction (used by fix_all_csvs). |
|
||
| `row_eq.py` | Row equality for sewing. |
|
||
|
||
## Not kept (removed)
|
||
|
||
- `extract_table_frames.py` – superseded by `extract_frames_and_tables.py`.
|
||
- `extract_every_n_frames.py` – logic folded into `extract_frames_and_tables.py`.
|
||
- `sewn.csv` – superseded by `result.csv` (sew now writes result.csv and dedupes).
|
||
|
||
## Clean re-run
|
||
|
||
To start from scratch:
|
||
|
||
- Delete or clear `frames/` and optionally `llm-fixed-frames/`.
|
||
- Run `./run.sh` (or the manual-fix flow above).
|
||
|
||
`result.csv` is overwritten each run.
|