During the second half of 2020, a team made up of undergraduate and graduate students who were studying the Advanced Data Science Laboratory discipline were able to focus their efforts on solving some challenges of Querido Diário, a project of the Data Science for Civic Innovation Program at OKBR that seeks to release data from the official journals of Brazilian municipalities. The partnership signed between Open Knowledge Brasil and the Institute of Mathematics and Statistics of the University of São Paulo (IME-USP) ended in December 2020 with many advances for Querido Diário.
The team had as a database a set of PDF documents collected between February 1st and June 15th using scrapers already developed for 391 municipalities. The project’s objective was to analyze the content of the texts to identify possible suspicious purchases related to measures to combat the coronavirus pandemic. Thus, the set contained documents that cited at least one of eleven relevant terms, such as “bidding waiver”, “Personal Protective Equipment”, and “lung ventilators”.
Many challenges in the universe of data analysis were faced, such as the unfavorable layout of most official journals. The texts, usually arranged in two columns, make it difficult for an automated interpretation by a code, which interprets only one line at a time in its development pattern. Characteristics related to the lack of standardization of the diaries are a problem that crosses this and several other fronts of Querido Diário, such as the Census of Official Diaries – a survey of the online availability of publications in each municipality.
The proposed solution to search for suspicious purchases was to survey the CNPJs of companies that appear in the official journals as providers of some service to city halls and the subsequent cross-referencing with external bases, such as the Registry of Disreputable and Suspicious Companies and that of campaign donations elections realized in 2016.
As a result, the group extracted 23,070 CNPJs, of which 1% were invalid. The most likely cause is typos that end up being recorded in an official document, revealing a failure in the transparency of the information. From the crossing with the two bases, no record was found that could be considered suspicious.
All the code generated by the team, composed by Gabriel Trettel, George Othon, Tiago Lubiana, and Wesley Seidel, is available for free access in this repository.
Credits: text written by Ariane Alves and adapted.