Paris School of Economics - École d'Économie de Paris

La science économique au service de la société

Quelles sources ? quelle organisation ?

The D-FIH database

The first goal of the project is to build an exhaustive database covering all the assets traded at the Paris Bourse between 1796 and 1976. The Equipment begins in 1796 as the oldest official list of the Paris Bourse found in the archives of the Exchange dates back to this year, and it ends in 1976, because the Eurofidai database starts on January 2, 1977. The Equipment stores information on all the financial instruments, shares and bonds (public and private) issued by French and foreign issuers, and traded on the Paris Bourse ; it registers on a bimonthly basis spot, forward and options prices, as well as dividends, coupons, (reverse) splits, and other securities events relevant for the computation of prices harmonized over time. It also records features of the issuers such as issuer type and nationality and, for private issuers, registered office, business sector, directors and balance sheet, etc… Lastly, it includes the foreign exchange rates and the prices of precious materials quoted at the Paris Bourse.

The two main sources

The construction of the Equipment is based on two main printed serial sources : the official lists of the Paris Stock Exchange publishing the information about the traded assets, and official and private stock exchange yearbooks publishing information about issuers. The data capture strategy depends on several variables primarily related to the sources and the state of the art of technology. The accessibility of sources, the quality and layout of the printed documents, frequency of changes in their formats (and therefore the type of information contained) are the crucial factors to be taken into account. On the other hand, it is mainly the state of the art of the technology that determines the relative costs of different data capture options : trade-offs must be made between technologies that are established but characterized by a high « unit cost » and investments in innovations that push the technological frontier forward and proportionally reduce the unit cost of data.

Building a new technology

Within the framework of the Equipex D-FIH, consideration of constraints and opportunities led to setup two data capture technologies requiring both the scanning of the printed sources. First, IT-organized manual data entry within an ad hoc environment of the data published on the lists of the Paris Stock Exchange until 1950, and, secondly, the semi-automatic processing of the stock exchange yearbooks by a specific software based on optical character recognition (OCR) and artificial intelligence. The need to develop this specific software has been dictated by three factors : first, the failure of software best OCR commercial software ; then, the need to develop a heuristic process allowing the software to learn by doing ; finally, the opportunity to directly transfer the flow of data into appropriate tables of the database after the OCR semi-automatic processing of data. The opportunity of processing the official list by a similar software for the post-1950 period is under study.

Taking into account direct and indirect costs, this software certainly offers gains over the IT-organized manual data entry for the yearbooks. However, the use of the software still requires significant human work before and during its deployment. First, it was necessary to create within the database an informational architecture allowing the software disaggregating appropriately the data into the database ; second, the flow of data from the OCR must be checked and validated by operators. This important human work for the constitution of databases from ancient sources, including tabular ones, remains a lock that makes the data production expensive. Research in digital humanities as meeting place for experts in social sciences and information technologies is thus called to work to achieve the « Big Data Revolution » in history, that is to say the development of tools allowing for the production, from historical sources, of large amounts of data at a reasonable cost.