csv-extractor/README.md
2026-02-04 21:11:16 -03:00

2.4 KiB
Raw Blame History

Table extraction from scrolling video

Extract a table from a screen-recorded video: sample frames, OCR (Portuguese), align to column/row bounds from an SVG template, then merge and deduplicate into one CSV.

Inputs

File Role
video-data.webm Source video (scroll-down table).
template.svg Annotation: rectangles under the image = column bounds; 28 rows.

Outputs

File / dir Role
result.csv Final table: 6 columns (Concessionária, Código, Rodovia/UF, km inicial, km final, Extensão), one row per segment, deduplicated.
frames/ Working: one PNG and one CSV per sampled frame (e.g. 0, 10, 20, …).
llm-fixed-frames/ Optional: copy of frame CSVs after manual/LLM fixes; sew from here instead of frames/ if used.

Run (same chore again)

Fully automated (extract → fix → sew → result):

nix-shell -p python3 python3Packages.opencv4 python3Packages.numpy python3Packages.pytesseract tesseract ffmpeg --run "./run.sh"

Or with custom video / output dir:

./run.sh path/to/video.webm my_frames
# result.csv is still written at project root

If you do manual fixes: copy frames/*.csv into llm-fixed-frames/, edit CSVs, then:

python3 sew_csvs.py llm-fixed-frames result.csv

Scripts (keep)

Script Role
extract_frames_and_tables.py Sample video every N frames → PNGs; OCR (por) + SVG column/row bounds → one CSV per frame.
fix_all_csvs.py Heuristic fixes on frame CSVs (strip, E→-, pipe→space, extensão from km).
sew_csvs.py Merge frame CSVs in order, remove boundary overlap, deduplicate full rows, write result.
svg_columns.py Parse column/row rectangles from template.svg.
assign_cells.py Map word boxes to (col, row); merge cell text (col1 no space, others space).
clean_csv_heuristics.py Per-row cleanup and extensão correction (used by fix_all_csvs).
row_eq.py Row equality for sewing.

Not kept (removed)

  • extract_table_frames.py superseded by extract_frames_and_tables.py.
  • extract_every_n_frames.py logic folded into extract_frames_and_tables.py.
  • sewn.csv superseded by result.csv (sew now writes result.csv and dedupes).

Clean re-run

To start from scratch:

  • Delete or clear frames/ and optionally llm-fixed-frames/.
  • Run ./run.sh (or the manual-fix flow above).

result.csv is overwritten each run.