Paris School of Economics - École d'Économie de Paris

La science économique au service de la société

Organiser l’intégration des données

A major issue for such a long-run quantitative and qualitative database is the optimization of processes to capture data. In order to achieve an effective data entry, the Equipex applies two data capture technologies, both requiring the scanning of the printed sources. First, IT-organized manual data entry within an ad hoc environment of the data published on the lists of the Paris Stock Exchange until 1950, and, secondly, the semi-automatic processing of the stock exchange yearbooks by a specific software based on optical character recognition (OCR) and artificial intelligence. The opportunity of processing the official lists by a similar software for the post-1950 period is under study, the tests run for the interwar period having been unsatisfactory.
The optimization of the two technologies and the respect of the historical sources require first the implementation of the “informational architecture” within the database.

1) Implementation of the “informational architecture” within the database.

The informational architecture of the database is given by the the time-dependent structure of the official lists, the issuers of listed securities and links between securities and issuers. To do so, first, the tree structure of the official lists’ sectors has been entered. For each sector as it named on the official list, the precise dates in which it appears and disappears have been recorded. This allows the replication of the varying tree structure of the official list at each point of time.
After creating the sectors into the database, the names of all the securities as they are published on the official lists have been recorded and inserted in the right sectors, taking into account the time dimension, i.e. identifying precisely the IPO and delisting dates. By this way, at each point of time, a precise virtual version of the official list with all the securities classified by sector can be reconstructed within the database.
In parallel, the names of the issuers of the recorded securities have been created within the database as they were published on the yearbooks. The names of both securities and issuers have been implemented through specific data entry masks (java language) automatically creating the links between the issuer and the securities it issued.
The names of both securities and issuers have been recorded as they are published on the sources to optimize the two data entry technologies (see below). Nevertheless, they can take various names over time (within the D-FIH database, they are respectively called “stock_name” and “corporation_name”). For example, the government bond “Roumanie 5%” is expressed as : “Roumanie 5%, Oblig. de 500 fr.” until 1894 and as “Romanie 5%, Oblig de 25 fr. de rente” after 1894. On the issuer side, “Banque parisienne” changed its name into “Banque de l’union parisienne” on 1904. This is why the database has been developed to be able to link the different names that the same security and issuer can take, to the same entity (within the D-FIH database, they are respectively called “stock” and “corporation”). As a consequence, it is possible to query the database along both its “historical” and “financial” dimensions.

2) Data entry

As far as the data entry is concerned, we employed two technologies : the IT-organized manual data entry on the one hand and, on the other, a semi-automatic process based on a specific software.
The IT organized data entry concerns at least the official lists from 1976 to 1950. This data entry has been outsourced to a specialized social firm. In order to target an accuracy rate of 99.99%, it has been decided to adopt a « double data entry »:the same data are entered by two operators ; in case of differences, a third person check the data out. The collaboration with this specialized social company began by adapting the logical data model to take into account the specificities of their production processes and the subsequent interactions. Various teams of operators have been specialized by type of data (i.e. cash prices, forward prices, dividends...), in order to ensure satisfactory productivity. For each data type, the Equipex implemented simplified input masks with only the entries corresponding to a given sub-set of data published on the official lists of a given timespan. Thanks to the informational structure of the database, each operator can select from the database list’s tree of a given day, a security whose name is precisely like the one she finds on the official list and at the same level. The operator can then easily match the names of the securities as they are into the database and on the official list and enter the corresponding data read on the signposted image of the list nested within the mask. This process minimizes the risk of mistakes in the identification of the security. Before starting the data entry, the operators have received a specific training on how to read the official lists by members of the project. To give to the operators further support in reading the official lists, an off-line website allowing for exchanges between teams and storing of questions and answers has been set up.
The semi-automatic data entry is based on specific software developed in cooperation with a service provider. The specific software combines OCR techniques and artificial intelligence to learn from its errors and insert automatically the data into the database. It is used to capture and enter into the database all the information collected from the yearbooks. This procedure involves several steps. After the scanning of the sources, lexical dictionaries to help the OCR software in reading the yearbooks have been built and a set of rules to achieve an effective optical recognition of the information, its treatment and insertion into the database has been written by the project members and coded by the engineers of the services provider. A system of interactions which allows trained research assistants supervised by experts of the field modifying, correcting and completing the information collected by the OCR technology has been implemented. Thanks to the matching between the name of the issuer stored into the database and the one published on the yearbooks, it is possible to import automatically the data into the database and create the pertinent links among the data.
The opportunity of processing the official lists from 1950 to 1976 with a similar software is under study. The informational architecture facilitates the data entry and enables the insertion of the prices, for every security at any frequency. The daily frequency being too costly for the budget of the project, the Scientific Council has opted for entering the prices on bi-monthly basis : on the 15th and the last day of each month. Under budgetary constraints, this choice was based on the functioning of the exchange. For most of the period covered by the database, derivatives (forward operations and options) were the bulk of the transactions. The delivery and settlement of these transactions were organised by the exchange on bi-monthly basis, on the 15th and last day of the month.

3) Check and validation of the data

The check and validation of the data depends on the technology deployed for the capture. As far as the IT-organized data entry is concerned, once the operators completed the data entry, Equipex proceeds internally with check, validation and insertion of them into the database. First, Equipex IT team checks the SQL file containing the data and verifies it from an IT point of view, in order to uncover eventual problems that would prevent a correct insertion of the data into the database. Then, the series of data are automatically checked to detect “anomalies” in the sequences. For example, this operation enlightens “abnormal" returns, this is to say returns exceeding a threshold that can be set according the historical period ; the prices exceeding the threshold are systematically checked and corrected when wrong. The last step of the process is made on a sample of at least the 10% of the data received. The sample is fully checked and the accuracy rate calculated. If the whole sample of checked data achieves a 99,99% accuracy rate, it is validated and inserted into the database. If the accuracy rate is lower, the data are rejected and the the specialized services provider must provide another dataset, which is checked once again, following the three steps.
As far as the data captured by the specific software are concerned, they are submitted to specific automatic coherence check embedded in the software and exploiting the redundancy of the information. The yearbooks publish every year a notice on a given listed issuer. The notices concerning this issuer published on subsequent yearbooks change overtime, but first a significant set of information remains stable, second, the rest of information is republished as it is on at least three or four subsequent yearbooks. For example, the date of foundation of a corporation is published on every yearbook and evidently must be the same. If the specific software reads “ 20/09/1905” as date of foundation of a given corporation on the yearbook N, but it reads “20/09/1903” on the yearbooks N+1, then it send an alert. An operator check on the images of the yearbooks, embedded within the software, and correct. Beyond these automatic coherence checks, additional manual checks are run on a sample of at least the 5% of each category of data captured by the software for each yearbook.