Organiser l’intégration des données

A major issue for such a long-run quantitative and qualitative database is the optimization of processes to capture data. In order to achieve an effective data entry, the Equipex applies two data capture technologies, both requiring the scanning of the printed sources. First, IT-organized manual data entry within an ad hoc environment of the data published on the lists of the Paris Stock Exchange until 1950, and, secondly, the semi-automatic processing of the stock exchange yearbooks by a specific software based on optical character recognition (OCR) and artificial intelligence. The opportunity of processing the official lists by a similar software for the post-1950 period is under study, the tests run for the interwar period having been unsatisfactory.
The optimization of the two technologies and the respect of the historical sources require first the implementation of the “informational architecture” within the database.

1) Implementation of the “informational architecture” within the database.

The informational architecture of the database is given by the the time-dependent structure of the official lists, the issuers of listed securities and links between securities and issuers. To do so, first, the tree structure of the official lists’ sectors has been entered. For each sector as it named on the official list, the precise dates in which it appears and disappears have been recorded. This allows the replication of the varying tree structure of the official list at each point of time.
After creating the sectors into the database, the names of all the securities as they are published on the official lists have been recorded and inserted in the right sectors, taking into account the time dimension, i.e. identifying precisely the IPO and delisting dates. By this way, at each point of time, a precise virtual version of the official list with all the securities classified by sector can be reconstructed within the database.
In parallel, the names of the issuers of the recorded securities have been created within the database as they were published on the yearbooks. The names of both securities and issuers have been implemented through specific data entry masks (java language) automatically creating the links between the issuer and the securities it issued.
The names of both securities and issuers have been recorded as they are published on the sources to optimize the two data entry technologies (see below). Nevertheless, they can take various names over time (within the D-FIH database, they are respectively called “stock_name” and “corporation_name”). For example, the government bond “Roumanie 5%” is expressed as : “Roumanie 5%, Oblig. de 500 fr.” until 1894 and as “Romanie 5%, Oblig de 25 fr. de rente” after 1894. On the issuer side, “Banque parisienne” changed its name into “Banque de l’union parisienne” on 1904. This is why the database has been developed to be able to link the different names that the same security and issuer can take, to the same entity (within the D-FIH database, they are respectively called “stock” and “corporation”). As a consequence, it is possible to query the database along both its “historical” and “financial” dimensions.

2) Data entry

As far as the data entry is concerned, we employed two technologies : the IT-organized manual data entry on the one hand and, on the other, a semi-automatic process based on a specific software.
The IT organized data entry concerns at least the official lists from 1976 to 1950. This data entry has been outsourced to a specialized social firm. In order to target an accuracy rate of 99.99%, it has been decided to adopt a « double data entry »:the same data are entered by two operators ; in case of differences, a third person check the data out. The collaboration with this specialized social company began by adapting the logical data model to take into account the specificities of their production processes and the subsequent interactions. Various teams of operators have been specialized by type of data (i.e. cash prices, forward prices, dividends...), in order to ensure satisfactory productivity. For each data type, the Equipex implemented simplified input masks with only the entries corresponding to a given sub-set of data published on the official lists of a given timespan. Thanks to the informational structure of the database, each operator can select from the database list’s tree of a given day, a security whose name is precisely like the one she finds on the official list and at the same level. The operator can then easily match the names of the securities as they are into the database and on the official list and enter the corresponding data read on the signposted image of the list nested within the mask. This process minimizes the risk of mistakes in the identification of the security. Before starting the data entry, the operators have received a specific training on how to read the official lists by members of the project. To give to the operators further support in reading the official lists, an off-line website allowing for exchanges between teams and storing of questions and answers has been set up.
The semi-automatic data entry is based on specific software developed in cooperation with a service provider. The specific software combines OCR techniques and artificial intelligence to learn from its errors and insert automatically the data into the database. It is used to capture and enter into the database all the information collected from the yearbooks. This procedure involves several steps. After the scanning of the sources, lexical dictionaries to help the OCR software in reading the yearbooks have been built and a set of rules to achieve an effective optical recognition of the information, its treatment and insertion into the database has been written by the project members and coded by the engineers of the services provider. A system of interactions which allows trained research assistants supervised by experts of the field modifying, correcting and completing the information collected by the OCR technology has been implemented. Thanks to the matching between the name of the issuer stored into the database and the one published on the yearbooks, it is possible to import automatically the data into the database and create the pertinent links among the data.
The opportunity of processing the official lists from 1950 to 1976 with a similar software is under study. The informational architecture facilitates the data entry and enables the insertion of the prices, for every security at any frequency. The daily frequency being too costly for the budget of the project, the Scientific Council has opted for entering the prices on bi-monthly basis : on the 15th and the last day of each month. Under budgetary constraints, this choice was based on the functioning of the exchange. For most of the period covered by the database, derivatives (forward operations and options) were the bulk of the transactions. The delivery and settlement of these transactions were organised by the exchange on bi-monthly basis, on the 15th and last day of the month.

3) Check and validation of the data

The check and validation of the data depends on the technology deployed for the capture. As far as the IT-organized data entry is concerned, once the operators completed the data entry, Equipex proceeds internally with check, validation and insertion of them into the database. First, Equipex IT team checks the SQL file containing the data and verifies it from an IT point of view, in order to uncover eventual problems that would prevent a correct insertion of the data into the database. Then, the series of data are automatically checked to detect “anomalies” in the sequences. For example, this operation enlightens “abnormal" returns, this is to say returns exceeding a threshold that can be set according the historical period ; the prices exceeding the threshold are systematically checked and corrected when wrong. The last step of the process is made on a sample of at least the 10% of the data received. The sample is fully checked and the accuracy rate calculated. If the whole sample of checked data achieves a 99,99% accuracy rate, it is validated and inserted into the database. If the accuracy rate is lower, the data are rejected and the the specialized services provider must provide another dataset, which is checked once again, following the three steps.
As far as the data captured by the specific software are concerned, they are submitted to specific automatic coherence check embedded in the software and exploiting the redundancy of the information. The yearbooks publish every year a notice on a given listed issuer. The notices concerning this issuer published on subsequent yearbooks change overtime, but first a significant set of information remains stable, second, the rest of information is republished as it is on at least three or four subsequent yearbooks. For example, the date of foundation of a corporation is published on every yearbook and evidently must be the same. If the specific software reads “ 20/09/1905” as date of foundation of a given corporation on the yearbook N, but it reads “20/09/1903” on the yearbooks N+1, then it send an alert. An operator check on the images of the yearbooks, embedded within the software, and correct. Beyond these automatic coherence checks, additional manual checks are run on a sample of at least the 5% of each category of data captured by the software for each yearbook.

Agenda

→ Accéder à l’agenda complet (séminaires, workshops et conférences)

Jeudi 18 avril 2024

Travail et économie publique (interne)
PSE- 48 boulevard Jourdan, 74014 Paris, salle R1-09 12:30-13:30
OYON LERGA Unai : Bounding Treatment Effect Heterogeneity with an Application to Labor Economics

Vendredi 19 avril 2024

EU Tax Observatory Lunch Seminar
Salle R1-14 12:00-13:00
JAKOB BROUNSTEIN (IFS) : Retaining your corporate income tax base: Effects of a tax haven shareholdership reform in Ecuador
with Pierre Bachas and Alex Bajaña
Casual Friday Development Seminar
R1-09 12:00-13:00
SHARMA Vrinda (PSE) : Understanding adaptation to rising salinity in Vietnam
AHLBORN Laura (PSE) : Mother, Child, and the Economy: Evidence from India's Demonetization

Lundi 22 avril 2024

Soutenance de thèse : Arthur Heim
17h00-19h00
Arthur Heim :
Régulation et Environnement
R1-09 12:00-13:30
KELLOGG Ryan (University of Chicago) : *The End of Oil
Séminaire théorie économique Roy-Adres
R1-09 17:00-18:30
IIJIMA Ryota (Yale) : Multidimensional Screening with Rich Consumer Data
Mira Frick and Yuhta Ishii

Mardi 23 avril 2024

Paris trade seminar
PSE, 48 boulevard Jourdan, 75014 Paris, salle R2-01 14:30-16:00
MAGLI Martina (LMU) : Should we stay or should we go? Firms' adjustment to trade shocks
Holger Breilnich
GPET internal seminar
R1-13 09:00-12:40
*Workshop GPET
Lunch séminaire d'économie appliquée
Salle R2.21 12:30-13:30
TSOUTSOPLIDI Olivia (SciencesPo) : *
Petit Séminaire Informel de la Paris School of Economics
R1-09 17:00-18:00
MAYAUX Damien (PSE) : Utility and Contrast in Evidence Accumulation Models
Séminaire Trade Economists in Paris (STEP)
R1-13 13:00-14:00
PRAETORIUS Sophie (Science Po) : Collaboration in Technology and Multinational Production
Séminaire virtuel en économie du développement
Zoom 17:00-18:00
GENICOT Garance ((Georgetown University and CEPR)) : *

Mercredi 24 avril 2024

Economie du développement
R2.01 16:30-18:00
ANNAN Francis (University of California, Berkeley ) : Equilibrium Effects of Entry in Digital Financial Markets
Séminaire Histoire économique
R1.09 12:00-13:30
YAZDANI Kaveh (U. Connecticut) : The Biography of Capitalism(s) – 10th to 18th Centuries

Jeudi 25 avril 2024

Workshop | Trade and Environmental Transition
Workshop | Trade and Environmental Transition :
Macro Workshop
R1 -15(12h00-13h00)
Michael Barczay ( EUI) : On the Optimal Design of Consumption Taxes
19th Doctorissimes
Du 25 au 26 avril
19th Doctorissimes :
Doctoral meeting of the Research in International Economics and Finance (RIEF)
Du 25 au 26 avril
Doctoral meeting of the Research in International Economics and Finance (RIEF) :
Séminaire comportement
R2-21 11:00-12:00
COLSON-SIHRA Eve (Department of Economics and PPE Program, The Hebrew University of Jerusalem) : How Sticky are Consumption Stereotypes? Evidence from the Meat Gender Gap
Travail et économie publique
PSE- 48 boulevard Jourdan, 74014 Paris, salle R2-21 12:30-13:30
MACHIN Stephen (London School of Economics) : Government Contracting and Living Wages > Minimum Wages
Nikhil Datta
Groupe de travail Comportement
R2-21 10:00-11:00
MAYAUX Damien (PSE) : Utility and Contrast in Evidence Accumulation Models
Lunch séminaire TOM
R1-13 12:30-13:30
SATPATHY Aviman (PSE) : Navigating Complexity in Choice under Uncertainty: Coarse Payoff-Assessment Learning Model
Philippe Jehiel
Macroéconomie
PSE- 48 boulevard Jourdan, 75014 Paris, salle R2-21 16:00-17:15
RUGGIERI Alessandro (CUNEF Universidad) : Misallocation and Inequality
N.Guner (CEMFI)
Paris Empirical Political Economics Seminar (PEPES)
Room H405 at Sciences Po 12:30-14:00
REYNAL-QUEROL Marta (UPF) : The Colonial Origins of State Capacity: Evidence from Spanish Conquerors in Latin America

Vendredi 26 avril 2024

EU Tax Observatory Lunch Seminar
Salle R1.14 13:30-14:30
WAMSER Georg (Tübingen University) : Effective Corporate Income Taxation and Corruption
with Peter Egger, Sean Mc Auliffe and Valeria Merlo
Casual Friday Development Seminar
R1-09 13:00-14:00
MALLIA Paola (PSE) : *
Séminaire interne PSE
12:00-13:00
HUANG Yuchen (PSE) : Non-Meritocrats or Conformist Meritocrats? A Redistribution Experiment in China and France
ELLISON Sara (MIT) : Effects of Home Rental Sites on Residential Real Estate: Evidence from New Hampshire

Lundi 29 avril 2024

Paris Econometrics Seminar
PSE, room R1-14 16:15-17:30
GU Jiaying (University of Toronto) : TBA
Régulation et Environnement
R1-09 12:00-13:30
GILLINGHAM Kenneth (Yale University) : *
Séminaire théorie économique Roy-Adres
R1-09 17:00-18:30
RUBINSTEIN Ariel (NYU) : No prices and no games: the case of matching problems
Michael Richter

Mardi 30 avril 2024

Lunch séminaire d'économie appliquée
12:30-13:30
ÖZGÜZEL Cem (CES & IZA) : Shift to Remote Work, Performance, and Well-being
GIRAY AKSOY CEVAT (EBRD & Kings College London)
BLOOM Nicholas (Stanford University)
DAVIS Lucas (Stanford University)
MARINO Victoria (EBRD)
Petit Séminaire Informel de la Paris School of Economics
R1-09 17:00-18:00
VARDAXOGLOU Laurence (PSE) : Voting under the influence of far right misinformation

Jeudi 2 mai 2024

Macro Workshop
R1 -15(12h00-12h30)
Ornella Torres (PSE) : Global Imbalances, Interest Rates and the Green Transition, co-écrit avec Agnès Bénassy-Quéré et Katheline Schubert.
Macro Workshop
R1 -15(12h30-13h00)
Grégoire Sempé (Paris School of Economics, Université Paris 1 Panthéon Sorbonne) : On the importance of horizontal heterogeneity for Climate Policies
Séminaire comportement
R2.21 11:00-12:00
FRIEBEL Guido (Goethe University, Frankfurt, Germany) : *
Travail et économie publique
PSE- 48 boulevard Jourdan, 74014 Paris, salle R2-21 12:30-13:30
RAUH Christopher (University of Cambridge) : Beliefs About Maternal Labor Supply
Teodora Boneva, Marta Golin, and Katja Kaufmann
Lunch séminaire TOM
R1-14 12:30-13:30
LLEONART ANGUIX Manuel (Universitat Autònoma de Barcelona) : *
Macroéconomie
PSE- 48 boulevard Jourdan, 75014 Paris, salle R2-21 16:00-17:15
CANTORE Cristiano (Sapienza University ) : A tail of labor supply and a tale of monetary policy
Filippo Ferroni Chicago Fed, Haroon Mumtaz Queen Mary University of London and Angeliki Theophilopoulou Brunel University London