Making Federal Reports Usable for AI
Program Integrity Alliance

The United States (U.S.) government publishes millions of pages of reports annually—a cornerstone of democratic accountability and transparency. Yet despite being publicly available, much federal reporting data is not readily usable for modern analysis, including AI-driven research. Reports are often difficult to access in bulk, hard to analyze consistently, and challenging to reuse at scale. These limitations have real consequences: researchers spend months repeating data extraction work that has already been done elsewhere, journalists struggle to identify patterns across agencies, and insights from oversight bodies remain fragmented. When data that should inform accountability is locked in formats that are difficult for machines to interpret, everyone loses—including the agencies themselves, which could benefit from more efficient analysis of their own reporting.
At the Program Integrity Alliance (PIA), we aim to address these challenges by building public interest technology focused on making federal reports more usable for analysis and AI. Our work is fueled by one of the largest repositories of public documentation available, spanning millions of pages from thousands of government entities. We extract, standardize, and augment this data to produce free tools that help civil servants, researchers, journalists, and academics more easily explore and analyze government reports, including GovQuery, Recommendation Spotlight, and PIA Connect.
In the process of building these tools, we encountered recurring challenges that make federal reports difficult to use at scale—particularly for automated and AI-driven analysis. In this post, we share what we've learned, along with practical improvements that could make agency reports easier to access, interpret, and reuse. Making federal reports usable for AI is not just a technical concern; it is a necessary step toward more effective oversight, research, and accountability in government.
Websites Are Not Designed for Data Access
Most government reports are published on websites—and for good reason. These sites are designed for people to browse, search, and download individual reports, not for large-scale or automated data collection.
Because of this, extracting reports in bulk requires navigating complex, site-specific workflows. Each website is different, so customized processes must be built for every source. This creates substantial additional effort to develop automated downloads, as well as ongoing maintenance overhead. When a website's design changes—as happens frequently—these processes can break, making long-term automated access fragile and expensive.
At PIA, we take a careful and responsible approach to working with websites. We only extract data in bulk outside of business hours (typically on weekends), cache results to avoid repeat downloads, and throttle requests so that no more than one request is made every few seconds—never faster than a human could reasonably browse the site.
While this approach minimizes impact on public websites, it also introduces unavoidable delays. When working with tens of thousands of reports, these constraints significantly slow down data extraction timelines.
At the Mercy of Website Search Engines
Another challenge is that on many agency websites, reports are only discoverable through the website search engine. Search engines are an important tool for users on the website, but for accessing data they add an extra layer of complexity. We have found that the total reports in search engine result pages can differ from the totals presented on the website, suggesting some issue or bug. This is entirely to be expected with any software, but illustrates why it would be desirable to also be able to access report data without going through a search engine layer.
Metadata Exists — But Is Very Hard to Extract
It's often the case that there has been excellent work by agencies to classify documents into meaningful categories on source websites. Typically, these categories and tags appear as filters in the search engine, but we found in many cases that they are not displayed on the actual report pages. It's great to be able to filter reports in a search engine by federal agency, but it makes life difficult if the report pages themselves don't consistently mention the agency. It then requires a complicated process of inferring metadata on documents by filtering search queries, another cumbersome and time-consuming process.
Federal Agency Names Are Inconsistent
Another discovery was that different websites use different names and hierarchies for federal agencies. We even found that agency naming conventions can differ in different reports on the same website. Basically, there appears to be no standard or central list of agencies in use on the websites we analyzed. It may sound like a trivial thing, but with over 1,500 federal agencies and their entities, not using a standard set adds a huge amount of work for anybody who wants to analyze reports by agency, or combine with other data sources.
Documents Are AI-Unfriendly
We found that it was often difficult to extract information from PDF reports when trying to use AI for analysis. Some reports are being published as 'Scanned' PDFs, meaning they are composed of images of text pages. We expect this to occur with older documents, but some new reports are still being published in this format. It's a problem because images of text need to be converted using complicated, slow, and potentially expensive Optical Character Recognition (OCR). This can be error-prone and doesn't scale well.
Even text-based PDFs present a challenge. To get documents AI-ready, which requires clean, structured text, one of the most surprising challenges is parsing PDF documents. It sounds simple, but when document layouts and standards vary widely, the information they contain can be noisy and difficult for AI to understand.
In practice, many of the challenges attributed to "AI limitations" are actually data-quality and document-structure problems that could be addressed at publication time.
Suggested Improvements
The following changes could have a significant impact and reduce the barriers to using their reports and data:
Provide a manifest file on websites that lists all available reports
One simple change that would immediately improve data extraction would be to provide a manifest file (or files) on the website listing all available reports, their website location, and associated metadata. This data likely already exists in the databases that power site search engines and should therefore be relatively straightforward to automatically extract. A reports list file would eliminate the need to use website search engines or site navigation to find reports, vastly simplifying access to open report data.
This would be a low-effort first step agencies could take in order to make their document data more readily accessible for AI-powered analysis.
Better still, provide Application Programming Interfaces (APIs)
As noted above, downloading data from websites is brittle, expensive, slow, and potentially inaccurate. The standard for providing data is to use APIs. These would result in fewer queries on websites, but most importantly, provide a more robust and consistent way to extract data. Not only would this provide teams with a way to extract data in bulk, but APIs are also well-positioned for use in AI tools.
Save Report PDFs to be more AI-friendly
At a minimum, agencies can dramatically improve AI-readiness by following a few core principles when publishing PDFs.
Practices to avoid:
- ❌ Printing to PDF
- ❌ Scanning paper documents instead of exporting digitally
- ❌ Using layout for meaning (e.g. using indentation to infer heading level)
- ❌ Using images of text
- ❌ Using tabs/spaces for alignment
- ❌ Using floating text frames for main content
- ❌ Manually drawn tables or separators
Before Exporting:
- Use built-in styles for structure
- Use Heading 1–6 styles for section headings (never font size or bold alone)
- Use normal paragraph styles for body text
- Use real lists (numbered/bulleted list tools; never fake lists with hyphens, bullets, or spacing)
- Use real tables (insert tables using the table tool; mark header rows/columns; never create tables using tabs, spaces, or drawn lines)
- Maintain logical reading order (keep content in a simple, top-to-bottom flow; avoid multi-column layouts unless absolutely necessary)
- Avoid text boxes and floating elements (do not place main content in text boxes; set images to In Line with Text)
- Add alt text to non-decorative images (describe meaning, not appearance; mark purely decorative images as decorative)
- Use Unicode fonts (avoid custom or symbol fonts; avoid ligatures for body text)
- Do not outline or flatten text (text must remain selectable and searchable)
When Exporting:
- Use "Save As / Export → PDF" (always export directly from the authoring tool)
- Enable accessibility / tagging options (select "Best for electronic distribution and accessibility" if available; enable "Create bookmarks from headings" if available)
- Preserve text and structure (do not rasterize pages; do not flatten transparency if avoidable)
For readers interested in standards, Tagged PDF is defined in ISO 32000, and its practical application for accessibility is specified in PDF/UA (ISO 14289). The Library of Congress has adopted these standards as part of its digital preservation and accessibility guidance.
Use standard naming conventions
Slight variations in field naming conventions, such as agency and program, add significant overhead to downstream report analysis. If agencies used a standard list—for example, we use USASpending agency codes maintained by the Treasury Department—it would be easier to combine reports and datasets.
None of these changes require reinventing federal reporting systems. Small, incremental improvements, such as publishing report inventories, standardizing agency identifiers, and exporting structured PDFs, would significantly reduce barriers for oversight, research, and AI-driven analysis, benefiting both the public and federal agencies themselves.