cd /root/yasser/assignment1 && \PYTHONPATH=/root/y

Blog Post

عنوان المحادثة: cd /root/yasser/assignment1 && \PYTHONPATH=/root/yasser/assignment1 py...

التاريخ: 01.05.2026

التصنيف: 💻 البرمجة وتطوير البرمجيات

إجمالي الرسائل: 4 | ياسر: 3 | M: 1

Yasser

cd /root/yasser/assignment1 && \PYTHONPATH=/root/yasser/assignment1 python3 run.py \ --project-id TAKEOFF-56 \ --input-dir "/tmp/new_project_extracted/new project/01_Sample_Projects_With_Expected_Output/01_Sample_Projects_With_Expected_Output/TAKEOFF-56 - JACK & JONES STATEN ISLAND, NY/Project Files" \ --evaluate \ --expected-dir "/tmp/new_project_extracted/new project/01_Sample_Projects_With_Expected_Output/01_Sample_Projects_With_Expected_Output/TAKEOFF-56 - JACK & JONES STATEN ISLAND, NY/Expected Manual Output" \ 2>&1 | tee /tmp/takeoff56_run.log

بس حقولك شي عم بحاول اخليه يطلع اوتبوبت خاليا وبدي ابعتله هيك:I focused the build on constructing a robust, multi-stage extraction and evaluation pipeline rather than simply running a generic LLM over PDFs. For ingestion, I used multiple PDF libraries selected by content type — PyMuPDF for fast text extraction, pdfplumber for complex table structures, and a multi-phase OCR engine for scanned pages. The OCR pipeline goes far beyond standard Tesseract: it includes grayscale conversion, sharpening, contrast enhancement, adaptive thresholding, and smart region detection that crops critical areas differently depending on image size and density. Post-OCR, I apply spell correction via pyspellchecker and an optional LLM filter to normalize domain-specific construction terminology. For AI extraction, I use a tiered model approach — GPT-4o for smaller files, GPT-4o-mini for large specification documents, and GPT-4o Vision for vector-based CAD drawings — with intelligent chunking that reduced API costs by roughly sixty percent. The evaluation engine is equally deliberate: it uses rapidfuzz for fuzzy matching, enforces twenty-five-plus equipment tag patterns, applies critical keyword validation, and tracks quantity differences with percentage deviation, classifying every item as matched, missing, or extra. The entire system runs behind a FastAPI backend with a React frontend, containerized via Docker Compose, and uses Pydantic models for strict output validation. The trade-off was project volume: I fully processed three sample projects and two challenge projects end-to-end, prioritizing pipeline depth and evaluation rigor over shallow coverage of the full challenge set.

01.05.2026 16:05

ياسر

01.05.2026 16:31