2.4 KiB
2.4 KiB
Table extraction from scrolling video
Extract a table from a screen-recorded video: sample frames, OCR (Portuguese), align to column/row bounds from an SVG template, then merge and deduplicate into one CSV.
Inputs
| File | Role |
|---|---|
video-data.webm |
Source video (scroll-down table). |
template.svg |
Annotation: rectangles under the image = column bounds; 28 rows. |
Outputs
| File / dir | Role |
|---|---|
result.csv |
Final table: 6 columns (Concessionária, Código, Rodovia/UF, km inicial, km final, Extensão), one row per segment, deduplicated. |
frames/ |
Working: one PNG and one CSV per sampled frame (e.g. 0, 10, 20, …). |
llm-fixed-frames/ |
Optional: copy of frame CSVs after manual/LLM fixes; sew from here instead of frames/ if used. |
Run (same chore again)
Fully automated (extract → fix → sew → result):
nix-shell -p python3 python3Packages.opencv4 python3Packages.numpy python3Packages.pytesseract tesseract ffmpeg --run "./run.sh"
Or with custom video / output dir:
./run.sh path/to/video.webm my_frames
# result.csv is still written at project root
If you do manual fixes: copy frames/*.csv into llm-fixed-frames/, edit CSVs, then:
python3 sew_csvs.py llm-fixed-frames result.csv
Scripts (keep)
| Script | Role |
|---|---|
extract_frames_and_tables.py |
Sample video every N frames → PNGs; OCR (por) + SVG column/row bounds → one CSV per frame. |
fix_all_csvs.py |
Heuristic fixes on frame CSVs (strip, E→-, pipe→space, extensão from km). |
sew_csvs.py |
Merge frame CSVs in order, remove boundary overlap, deduplicate full rows, write result. |
svg_columns.py |
Parse column/row rectangles from template.svg. |
assign_cells.py |
Map word boxes to (col, row); merge cell text (col1 no space, others space). |
clean_csv_heuristics.py |
Per-row cleanup and extensão correction (used by fix_all_csvs). |
row_eq.py |
Row equality for sewing. |
Not kept (removed)
extract_table_frames.py– superseded byextract_frames_and_tables.py.extract_every_n_frames.py– logic folded intoextract_frames_and_tables.py.sewn.csv– superseded byresult.csv(sew now writes result.csv and dedupes).
Clean re-run
To start from scratch:
- Delete or clear
frames/and optionallyllm-fixed-frames/. - Run
./run.sh(or the manual-fix flow above).
result.csv is overwritten each run.