GovQuery: the Program Integrity Alliance's New Search Engine
Program Integrity Alliance

Why GovQuery
Government produces vast amounts of information, but much of it is scattered and hard to search. GovQuery, our custom search engine, brings together key sources in one place to make this knowledge easier to find and use. Coupled with AI capabilities, GovQuery becomes a platform for synthesizing resources across agencies and generating new insights. We started with sources central to our mission of integrity and accountability, and we continue to expand with materials that provide broader value for understanding how government works.
What AI needs
Generative AI is often used as part of finding information in documents. A common pattern for this is Retrieval Augmented Generation (RAG), which in its simplest form, is where a user asks a question, a search is made through documents, and a Large Language Model (LLM) summarizes the results. This can actually get very, very complicated, but at the core there is one key component — search.
Any organization hoping to do fancy things with AI and their documents must be able to help the AI find information in those documents. Not just document summaries or searching titles, but the detailed text and images within those documents, typically across multiple data sources.
Without a solid search foundation for knowledge retrieval tasks, AI will struggle to deliver.
Existing Search Tools
Single website search engines
Within the US Government Open Data space there are some amazing search tools available. Most websites have comprehensive search engines which allow the user to search data on that website, for example Oversight.gov and the Government Accountability Office (GAO)'s website will allow people to search through document titles, summaries as well as oversight recommendations. They offer granular filtering and are a fantastic resource for users of those websites.
That said, these engines only work on the site they are designed for, if multiple sources of data need to be searched the user must visit each site one by one. They also don't always offer in-document search where there can be multiple matches of a user's question. Perhaps most importantly, they generally don't provide an API or systematic way for an AI platform to use them directly.
Google and family
This is where giant search engines like Google can help. They crawl government websites indexing PDF reports, so that people can search those reports in their day-to-day search engine, and benefit from advanced features like AI summarization which has yet to appear on most government websites.
However, even giants like Google have limitations. Since they have to support the world, they tend to be more generalized and not tailored to the field or domain in question. They often don't offer features like linking to the page a search hit occurred in a document. This can be a big deal, as reading through a 100-page PDF to find the part you're interested in isn't ideal.
GovQuery: more features, all in one place
For optimal AI performance, the ideal search solution should bring together the best parts of existing search engines into one platform for all of the required data sources. This is exactly what PIA has done as part of our wider AI solution development.
We first built an ingestion pipeline to collect publicly available text from a range of government agency websites, then used that to build a production AI-powered search engine. There is a tendency in Generative AI to develop this last part in-house as a custom search engine, but at PIA we believe in building with scaling in mind and instead opted to use Azure AI Search as a high-capacity enterprise-grade search solution.
Comparing it against individual website searches and Google, we see that GovQuery offers more features in one place.
Choosing the data
Google does have one big advantage; they basically index most of the internet. PIA operates at a different scale, ingesting and indexing a subset of data for our search engine.
We started by prioritizing data sources that offer insights into program integrity and accountability. But as we built the search engine, we saw the broader value these sources provide in helping people understand how government operates. That perspective continues to guide our approach to expanding our data coverage, which currently includes the following sources (updated in near real-time):
- Department of Justice (DOJ): All 190,000 press releases as extracted from the DOJ's API.
- Congressional Research Service: All 22,000 reports as found on everycrsreport.com.
- Government Accountability Office: 10,500 reports published since 2010, and 5,200 open recommendations.
- Offices of Inspectors General: 22,500 federal reports via Oversight.gov, and 12,400 open recommendations.
Since we collect oversight recommendations before 2010, we also selectively include the pre-2010 reports that contain these recommendations.
GovQuery as a public good
PIA originally built GovQuery to support its own AI projects. But because the search engine has useful features that the public cannot get for free elsewhere, PIA decided to make it available to everyone. By doing so, groups like researchers, journalists, oversight bodies, policy analysts, watchdogs, and students can now use the same advanced search capabilities that PIA relies on internally.
You can access GovQuery here.
For those more technically inclined who are building AI solutions, you can also use PIA's MCP server which includes GovQuery tools AIs can use for finding information. This server is also available on Docker MCP Hub.
As always we'd love to hear from you so we can all work together building the best tools. Feel free to reach out to info@programintegrity.org for further information or if you'd like to get involved.