csv-extractor/README.md
2026-02-04 21:11:16 -03:00

66 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Table extraction from scrolling video
Extract a table from a screen-recorded video: sample frames, OCR (Portuguese), align to column/row bounds from an SVG template, then merge and deduplicate into one CSV.
## Inputs
| File | Role |
|------|------|
| `video-data.webm` | Source video (scroll-down table). |
| `template.svg` | Annotation: rectangles under the image = column bounds; 28 rows. |
## Outputs
| File / dir | Role |
|------------|------|
| `result.csv` | Final table: 6 columns (Concessionária, Código, Rodovia/UF, km inicial, km final, Extensão), one row per segment, deduplicated. |
| `frames/` | Working: one PNG and one CSV per sampled frame (e.g. 0, 10, 20, …). |
| `llm-fixed-frames/` | Optional: copy of frame CSVs after manual/LLM fixes; sew from here instead of `frames/` if used. |
## Run (same chore again)
**Fully automated (extract → fix → sew → result):**
```bash
nix-shell -p python3 python3Packages.opencv4 python3Packages.numpy python3Packages.pytesseract tesseract ffmpeg --run "./run.sh"
```
Or with custom video / output dir:
```bash
./run.sh path/to/video.webm my_frames
# result.csv is still written at project root
```
**If you do manual fixes:** copy `frames/*.csv` into `llm-fixed-frames/`, edit CSVs, then:
```bash
python3 sew_csvs.py llm-fixed-frames result.csv
```
## Scripts (keep)
| Script | Role |
|--------|------|
| `extract_frames_and_tables.py` | Sample video every N frames → PNGs; OCR (por) + SVG column/row bounds → one CSV per frame. |
| `fix_all_csvs.py` | Heuristic fixes on frame CSVs (strip, E→-, pipe→space, extensão from km). |
| `sew_csvs.py` | Merge frame CSVs in order, remove boundary overlap, deduplicate full rows, write result. |
| `svg_columns.py` | Parse column/row rectangles from template.svg. |
| `assign_cells.py` | Map word boxes to (col, row); merge cell text (col1 no space, others space). |
| `clean_csv_heuristics.py` | Per-row cleanup and extensão correction (used by fix_all_csvs). |
| `row_eq.py` | Row equality for sewing. |
## Not kept (removed)
- `extract_table_frames.py` superseded by `extract_frames_and_tables.py`.
- `extract_every_n_frames.py` logic folded into `extract_frames_and_tables.py`.
- `sewn.csv` superseded by `result.csv` (sew now writes result.csv and dedupes).
## Clean re-run
To start from scratch:
- Delete or clear `frames/` and optionally `llm-fixed-frames/`.
- Run `./run.sh` (or the manual-fix flow above).
`result.csv` is overwritten each run.