Most commercial lease data lives in PDFs. Most lease analysis happens in Excel. Bridging that gap cleanly is one of the most common operational problems in commercial real estate, and most teams solve it badly.
Copy-paste is the instinctive approach. It also fails in predictable ways. Understanding why it fails -- and what the correct extraction architecture looks like -- determines whether your lease database is actually trustworthy.
Why Copy-Paste Fails
PDF is not a document format in the way Word or HTML is. It is a rendering format: the file describes where each character appears on a page, not the semantic structure of the content. A lease PDF that looks perfectly readable on screen may have underlying text encoded in non-sequential order, split across invisible text boxes, or layered on top of a scanned image with no machine-readable text at all.
When you copy text from a PDF lease and paste it into Excel, you get whatever the PDF renderer decides to extract. Multi-column layouts get linearized incorrectly. Numbers stored as individual characters get concatenated wrong. Tables lose their row-column alignment. Headers run together with body text. Rental schedules -- which are almost always formatted as tables -- are particularly prone to corruption.
Scanned leases are worse. Older leases executed before digital workflows became standard are often scanned images, not text PDFs. Copy-paste returns nothing, or garbage characters from noise in the scan.
Even when copy-paste works mechanically, it does not extract structured data. It copies text. Someone still has to read that text, identify which clause contains the rent commencement date, and type the value into the right cell. That is not data extraction -- it is manual re-keying with extra steps.
What a Proper Lease Spreadsheet Contains
A lease database that supports portfolio management needs a specific set of fields organized into categories. Trying to build this incrementally from ad-hoc copy-paste produces inconsistent coverage and unreliable values.
The 14 categories that a complete lease abstract covers:
Parties. Landlord legal name, tenant legal name, guarantor name, landlord notice address, tenant notice address.
Premises. Property address, suite or unit number, rentable square footage, usable square footage, common area factor, floor level.
Term. Lease commencement date, rent commencement date, expiration date, original lease term in months.
Base rent. Rent schedule (each period, rate, and escalation), current monthly rent, annual rent, rent per square foot, free rent periods.
Escalations. Escalation type (fixed percentage, CPI, fair market value), escalation frequency, fixed percentage rate, CPI index and cap.
Operating expenses. Lease structure (NNN, gross, modified gross), CAM inclusion list, CAM exclusion list, base year, expense stop, controllable expense cap, management fee cap.
Security deposit. Deposit amount, deposit form (cash, letter of credit), burn-down schedule, return conditions.
Options. Renewal option count and terms, renewal notice deadline, expansion option, contraction option, termination option.
Assignment and subletting. Assignment rights, subletting rights, recapture right, profit sharing on sublease.
Insurance. General liability minimum, property insurance requirements, additional insured requirement, certificate delivery deadline.
Use and exclusivity. Permitted use language, exclusivity clause, co-tenancy requirement.
Improvements. Tenant improvement allowance amount, delivery condition, landlord's work scope, TI deadline.
Special provisions. Right of first refusal, right of first offer, signage rights, parking spaces and ratio.
Governing law and notices. Jurisdiction, notice method, notice addresses, cure periods.
This is not exhaustive, but it covers the fields that drive operating decisions: budgeting, lease administration, CAM reconciliation, renewal planning, and portfolio reporting.
How AI Extraction Works
The correct technical architecture for lease PDF-to-Excel conversion is a three-stage pipeline.
Stage 1: OCR. Before any text can be processed, it must be reliably extracted from the PDF. AWS Textract is the enterprise standard for this -- it handles scanned documents, multi-column layouts, and mixed handwritten/printed text better than generic PDF extraction libraries. OCR output is raw text with positional information: the system knows where each word appears on the page.
Stage 2: LLM extraction. A large language model trained or prompted for legal document understanding reads the OCR output and extracts specific fields. This is where the semantic work happens: identifying which clause governs, resolving references ("as defined in Section 4.2(b)"), handling ambiguous language, and flagging low-confidence extractions. A well-prompted language model can handle the variation in how different leases express the same concept -- "net rentable area" vs. "rentable square footage" vs. "RSF" all map to the same field.
Stage 3: Structured output. The extracted values are written to a standardized schema and exported to Excel. One row per lease, one column per field. Each field value includes a confidence score so you know which fields were extracted cleanly and which need manual review.
Lextract runs this pipeline end-to-end: upload a lease PDF, receive an Excel file with 125+ structured fields and confidence scores in under 5 minutes. The output schema is consistent across every lease in your portfolio.
What the Excel Output Should Look Like
A lease export designed for portfolio management has one row per lease and one column per field. Escalation schedules are the exception -- they have multiple periods per lease and belong in a separate sheet (or a separate table in a relational database) with a foreign key back to the lease record.
Column naming matters for usability. Fields like rent_commencement_date, expiration_date, and cam_cap_pct are more useful than Date 1, Date 2, and CAM because they survive personnel turnover without explanation.
Date fields should be stored as actual Excel dates, not text strings. "March 15, 2027" and "3/15/27" and "15-Mar-27" all mean the same thing but require different handling in formulas. Consistent ISO-format dates (2027-03-15) stored as date values eliminate an entire category of downstream error.
Common Mistakes in Lease Data Extraction
Capturing rent as a single number. A lease with a fixed annual escalation has a different rent value every year. Storing only the current rent and not the escalation schedule means your model is wrong before the first anniversary.
Losing amendment data. When a lease has been amended, the original terms may have been superseded. An extraction that processes only the base lease document will contain stale data on amended fields. Each amendment must be processed and overlaid onto the base record.
Ignoring confidence scores. An AI extraction that returns every field without indicating uncertainty is more dangerous than one that flags low-confidence values. Confidently wrong data is worse than acknowledged gaps.
Not normalizing square footage. Rentable square footage, usable square footage, and BOMA-measured area are not the same. Mixing them in the same column produces rent-per-square-foot figures that are not comparable across properties.
A properly structured lease Excel database -- built from full 126-field extractions rather than manual copy-paste -- is the foundation for accurate portfolio budgeting, compliance tracking, and lease administration. Building it correctly the first time is significantly cheaper than correcting it later.