Hard Numbers: Open Consumer Price Database 1

We document a new source of consumer price microdata. The new database allows researchers studying consumer price behaviour to access current and granular raw statistical observations. The range of observed prices fully covers goods and services of the Rosstat’s CPI sample and extends beyond it. In this paper, we pursue two objectives. First, we describe the data collection mechanism, data structure, and their access protocols, as well provide four complete illustrations of their application using open API: i) training machine models of product classification based on text labels, ii) real-time tracking of product prices, iii) estimating hedonic regressions for product groups, and iv) calculating arbitrary analytical price indices. Second, we share a set of basic skills and technologies for the benefit of researchers interested in creating their own sources of alternative data.


Introduction: alternative sources of price data
Traditionally, the researcher's contribution has been the analysis.The data inputs are usually provided by national statistical institutes, international organisations, or dedicated statistical data providers, such as financial market trading platforms.
Today, as digital services penetrate the economic life of households, many business activities leave an increasingly larger digital footprint.This growing digital footprint opens up new opportunities for researchers, in particular, for monitoring the levels of activity and prices, discovering new patterns, and refining the existing ideas.
Today, the efforts to turn new data sources into sets of interpretable indicators are combined within the research area of alternative data viewed in the broadest sense of the term (this issue was discussed, in particular, at the Banca d'Italia and Federal Reserve Board's joint conference 'Non-Traditional Data and Statistical Learning with Applications to Macroeconomics' held in November 2020).
In a narrower sense, from the point of view of national statistical institutes, 'alternative data' are any observations obtained outside the conventional system of statistical reports and surveys (see Konny et al., 2019).This paper's subject is consumer price monitoring.In this area, two main types of data are commonly referred to as alternative data: 1) scanner data and 2) price data posted on the websites of retailers and other companies (webscraped data).
The consumer price data source that we have created and documented in this paper relies on web scraping and meets both the broad and narrow definitions of alternative data.These data are supplementary to official statistics.Researchers can use them independently, yet they have the potential to enrich the primary data sets of statistical offices.
The paper is structured as follows.In Section 2, we briefly compare our web scraping technology and scanner data with traditional price surveys.The comparison is made from the perspective of both the national statistical institute and the user of statistics.We also describe our own experience in arranging the data collection and share advice that may be useful to those researchers who are contemplating establishment of their own indicators.
In Section 3, we define the dataset in terms of the available information, data collection perimeter -both in terms of the scope of items tracked and geographyand frequency, data structure, etc.Beyond that, we describe the access mode and protocol for researchers and potential participants in the price monitoring project, as the data are open to interested researchers, with access provided via API (application programming interface) described below.
In Section 4, we present four examples for the application of our dataset: a) development of a machine learning product categorisation algorithm based on text labels, b) real-time price level tracking for individual goods2 or category of goods, such as during the periods of changes in regulatory policy or significant supply shocks, c) design of a prototype hedonic regression for a class of goods with a short technological cycle, d) calculation of arbitrary price indices, such as the food price index.These examples are illustrative.They are intended to show real-world data handling capabilities; the full data package to replicate these examples is attached to this paper.
In Section 5, we summarise and set out the future areas for working on 'hard numbers' as a new open source high-frequency microdata on consumer prices.

Technologies for collecting price statistics
Statistical agencies are actively developing tools to collect and analyse webscraping-based data on consumer prices.This effort is driven by the need to increase the efficiency of price collection procedures, i.e. (see da Silva et al., 2019), following new retail formats and improving the representativity of collected data, developing the methodology for calculating indices, refining more quickly the weight structure, expanding the observed sets of goods and services, and increasing the speed of calculations.
For example, Eurostat has published guidelines on data collection from websites/web scraping (EuroStat, 2020) and scanner data (EuroStat, 2017); the experience of the US Bureau of Labor Market Statistics (BLS), responsible for publishing consumer price indices, and its plans for working with alternative data are summarised in Konny et al. (2019); in the UK, systematic work is underway on incorporating alternative data into the calculation of official indices (Bhardwaj et al., 2017).Much of the work presented at the last UN Ottawa Group on Price Indices meeting also focused on new data sources (Ottawa Group, 2019).
The basic international guide to consumer price statistics (ILO et al., 2004) distinguished only three methods for collecting primary data on price levels: -centralised data collection, for example, when the data on the levels of regulated prices and tariffs can be accumulated in the central office of a national statistical institute based on reports from relevant organisations and government agencies; -the local observation of prices when the employees of the statistical institute visit offices and points of sale to collect price data; -scanner data, such as the retailers' datasets described by the manual as an increasingly important source of information, which is however so novel that there are no universal guidelines or recommendations on how to use it.
Updated manual (ILO et al., 2020) expands this list with a number of new sources.There are two main ones: datasets of prices that trading companies send to statistical agencies or those acquired by agencies from specialised aggregators (for example, Konny et al., 2019, discusses BLS experience in purchasing data on healthcare services and food prices) and, secondly, this includes public data on price levels derived from web scraping.
Below we will compare these alternative price monitoring technologies from the point of view of a statistical institute and from the data end-user's perspective.To keep the discussion simple, we are combining the scanner data with the data reported by the retail chains to statistical agencies.

Comparing alternative price observation technologies
When selecting a price monitoring technology, statistical agencies have to consider a number of factors which, generally, can be reduced to finding the right balance between the completeness of the information and the costs of monitoring both for the statistical institute and the reporting companies.
From the data user's point of view, the costs of monitoring are unlikely to become a significant factor in selecting the price monitoring technology, as such users generally do not incur such costs; for the latter, the decisive factors are the completeness of data and their availability.

Source: compiled by the authors
Table 1 provides the general comparison of price monitoring technologies, which are discussed in more detail below (based on Cavallo andRigobon, 2016, andKonny et al., 2019).
The completeness of observing the price levels to calculate the relevant indices can be described by two indicators: the observability of price-associated sales volumes and realistically achievable perimeter of observed prices.
Traditionally, sales volume data remain unavailable and, in all cases known to us, the price changes at the level of primary observations are aggregated by using the geometric mean of the growth rates at the primary indicator level and without considering the volumes.In this sense, the statistical institutes hope that alternative data enable them to consider the structure of spending on goods when aggregating the price increases at the level of primary observations.By the way, in the absence of such weights based on sales proportions, the UK Office for National Statistics proposed an experimental weighting scheme based on price distributions (Sands, 2020b).
The second indicator of the data completeness is the price observation perimeter with the following natural arrangement.Centralised observation and point-of-sale surveys have the highest costs and usually correspond to the narrowest observation perimeter as only about three quotations for a representative product can be recorded at one point of sale.Streaming data provides potentially wider coverage of the product range, but only for items actually sold in the relevant period.On the other hand, web scraping is very close to a retail outlet survey in the sense that all offered prices are collected, but at a significantly lower cost.
The definition of correct price collection may look different depending on the discount accounting practices adopted by the statistical institute.For example, the Consumer Price Index Manual (ILO et al., 2004) states that 'it is usual to reflect discounts and rebates only if unconditional, whereas loyalty schemes, money-off coupons, and other incentives are ignored' .It should be noted that the data on actual scanner data reflect prices taking into account such loyalty programmes and other conditional incentive schemes, while web scraping ignores such prices.However, the statistical agencies view the new data as a potential for capturing the full range of retail prices: regular, promotions available to everybody, promotions available to members of loyalty programmes, conditional (see Boettcher, 2015).
One of the most interesting areas for the application of alternative data is the classification of observations.For instance, with a small primary sample size, the statistical institutes usually do not face the challenge of classifying primary information; primary price collections scope is predefined by directives of the local or central statistical office.Rather, handling big data needs tools to identify relevant observations and to classify primary data according to the classifiers used by the statistical office.
This is an area of active research, where it is difficult and too early to make a definitive conclusion on the potentially achievable limits in the accuracy of the classification of observations from the various sources.However, the experience of Konny et al. (2019) suggests that the data streamed from different retail chains and points of sale can often be characterised by specific classification approaches, and it may be challenging to bring them all to a standard classification.
In this sense, web scraping data can potentially retain more details relating to the observed good: its textual name, detailed description, pointof-sale characteristics, and, in some cases, an image.Moreover, this approach allows better refining of the product range just by visiting the website with published prices.
The issue creating a major difference between the approaches is the one of the openness of primary data.Primary data accumulated by statistical institutes may not be disclosed and, under the legislation of the Russian Federation (Federal Law No. 54-FZ of 22 May 2003), neither are scanner data.By contrast, web scraping collects data from open sources.It should be noted that the use of web scraping is not direclty regulated by law, there is still no established legal practice in this regard (see Rossiyskaya Gazeta, 20 September 2019).
To sum up, we would like to emphasise that, whatever the differences in the completeness of data and the cost of observation from the viewpoint of statistical institutes, for researchers engaged in the creation of statistical indicators outside major retail chains or statistical agencies, web scraping is the technology that appears to be the most realistic in terms of implementation and support.
In the next section, we will share our experience in implementing web scraping and the set of technologies that we tested.

Web scraping technologies
Cavallo and Rigobon (2016) point out the use of such programming languages as Python and PHP for this purpose, later the use of specialized 'point-and-click' software solutions that require almost no technical expertise became possible.Boettcher (2015) describes the experience of Statistics Austria's import.ioservice, which allows data collection without programming.According to Polidoro et al. (2015), the Italian National Institute of Statistics relies on a web scraping toolset that includes HTQL, IRobotSoft, and iMacros.The Brazilian Institute of Geography and Statistics used the R Suite to develop their solution (da Silva et al., 2019).The choice of data collection technology is largely dictated by the skill set of researchers.
Developing a data source usually requires addressing two different kinds of problems: selecting a technology for collecting observations and choosing a technology for disseminating the results, for provision of the users' access to the data.Note that technology should be understood not only as the choice of programming language or dedicated software but also as the set of hardware to run these procedures on.
It took three stages to build the physical infrastructure for our project.At the initial stage, the data was collected on a single personal computer.In our experience, with a careful scraping scheduling, such a mechanism can reach its computational stability when the number of processed retail chains is about 10.At the second stage of the project development, we scaled the data collection by creating a small cluster of Raspberry Pi microcomputers.The advantage of such a cluster is that it allows parallelising the data collection process, and is relatively easy to scale by creating a universal system image that can be installed on a growing number of computing units.This approach is limited by the need to maintain network infrastructure and uninterrupted access to the network.At the third stage, we moved the data collection capacity to a cloud infrastructure: this approach allows more flexible scalability of the computing capacity, and also relieves the researcher from a number of challenges associated with maintaining the network access.
As for the data collection software, we opted for a Python-based architecture.The programming language has a vibrant developer community and a large ecosystem of tools to address web scraping tasks.
A note should be made with regard to the time costs of developing the web scrapers.In our experience, after acquiring the basic knowledge and skills, a motivated researcher can write the code for about 2-3 web scrapers a day for relatively complex websites.In Appendix 1, we give an example of a basic web scraper written in Python for a website.It contains the complete web scraper code that can collect about 16-20 thousand price quotes within a 30-minute website visit with no additional paralleling of the data collection process.
The second class of tasks involves data storage and dissemination (we will discuss data structure in more detail in Section 3).At this stage, it is important that researchers understand whether their team will work within one business' perimeter or whether they should consider the need to cooperate across the network boundaries of several businesses.
In our project, we assume that such a research task will require long-term interaction across the information system boundaries of different businesses.To systematise these interactions, namely, to allow adding data and getting access to collected datasets, we use the cloud-based version of database management system PostgreSQL with an application programming interface (API); for a detailed description of using the API, see Sections 3 and 4. In addition to this globally accessible system, there is also an internal database on Microsoft SQL Server that provides back-up storage for the accumulated data.The external PostgreSQL database and the internal Microsoft SQL Server database are regularly synced.

Source: authors' estimates
The technologies required for collecting alternative data are listed in Table 2.In our view, this toolset is close to the minimum necessary to build a sustainable distributed system of data collection.vol.80 no. 1

Dataset description and access protocol
The choice of the data range to be monitored is subject to the usual constraints that require a balance between the perimeter and the cost of price observation.In principle, it is possible to record quite a large body of information that potentially includes descriptive characteristics of goods, their rating by consumers, and so on.At the moment, we limit ourselves to recording the price levels; the complete set of available data is discussed below.Overall, it is close to what is suggested by the updated Consumer Price Index Manual (ILO et al., 2020).

Dataset description
The Consumer Price Index Manual (ILO et al., 2020) includes a detailed discussion of web scraping and recommended data structure.This structure has the following fields: 1) price recording date; 2) retail chain name; 3) item category according to the retailer's classification; 4) unique text name of the good; 5) price of good.The first data in our dataset dates back to January 2019.These data largely refer to the food part of the consumer basket.A qualitative coverage increase began in June 2020, when the number of covered goods rose first to 10,000 and later to several million items.
Currently, our dataset covers businesses operating in the Moscow metropolitan area.We are making efforts to cover all major retailers, fast food chains and businesses providing services included in the consumer basket.
The geographic limitations of the dataset have different importance depending on the category of goods and services.In terms of geography, for a significant share of the goods, the variable part of the price includes the cost and delivery time, while the price of the goods itself remains the same across Russia's regions.Food chains and services represent a natural exception, where the regional aspect is more relevant.
We should distinguish two qualitatively different sets of data: -primary price statistics; -categorisation and classification of primary price statistics.Primary price statistics allow capturing price levels for identifiable goods and services.
Categorisation and classification are the basic elements for the analytical aggregation of obtained primary observations into groups of goods.This is a fairly broad task, which includes both the basic identification of unique goods distributed by various retail chains and a higher-level aggregation of representative goods into the groupings based on the classifier of the statistical institute.
At this stage, we consider the collection of primary statistics data as the main task, while the classification is a secondary analytical task.
It should be noted that statistical institutes are developing algorithms for classification of text and other metadata to automate the processing of statistical data.These algorithms are not always complex, as Harms and Spinder (2019) show that the logistic regression provides the most accurate and efficient algorithm.The methods used for the classification by Statistics Norway are described in Myklatun (2019), the experience of the Netherlands is discussed by Griffioen and Ten Bosch (2016), Barcaroli et al. (2015) reviewed the experience of the Italian National Institute of Statistics, the experience of Luxembourg is described by Guerreiro et al. (2018).
The scale of data aggregation varies significantly depending on the decisions of statistical institutes.For example, Griffioen and Ten Bosch (2016) note that the Statistics Netherlands monitors ten websites to collect several thousand price quotes.
The main observation targets within our designed dataset are individual goods included in the catalogues of individual retail chains.
The characteristics of goods can generally be divided into two types: -permanent: name, description, physical characteristics; -variable, such as the price.
The database we maintain comprises the following permanent data sets: -a URL (reference to a web page) of goods or of price list; -a set of key dates: start of observation, last observation; -text name of the goods; -Rosstat classification code for part of the observations; -attributes of the data source for the goods: attributes of the party that entered the observation into the database.The price data records two prices: -price that can be used to make a transaction; -previous price before discount or price increase, if any.Today, the total number of our unique observations is 4.5 million.It has increased significantly since we intensified the work with the dataset at the end of the first half of 2020.Currently, there are 150,000 text names of goods.The number of individual scrapers amounts to about 90.
About 60,000 items were marked manually according to the Rosstat classifier.This work was done manually following the detailed classification requirements described in the Rosstat reference guide.Nevertheless, it should be noted that the task of classifying price observations into Rosstat categories or other classifier is, strictly speaking, a separate analytical task for which we seek to establish the necessary research infrastructure, but which we do not claim to have solved or to be addressing as a matter of priority.Our current goal is to vol.80 no. 1 create a dataset that would give researchers unprecedented access to current and granular consumer price statistics.

Data availability and principles of working with data
Our data are open to researchers.In order to structure access and streamline the load on the data provision capacity, we identify two types of access: -access with the right to add data for users who add price observations or observation classification records independently; -read-only access for users who use the collected data.Data can be accessed via the API.This allows obtaining data by using any method known to the researcher, including the popular statistical software that supports collecting data via API (Python, R, MATLAB, Excel, etc.) To gain access, submit a request on the project's website3 or write to the authors of this paper.
Currently, about 20 researchers have access to the data.We are not limiting the intensity of access and witness the validity of the developed platform.
To demonstrate the simplicity of working with data, we have included a sample code to obtain data on the consumer price index item in Appendix 2.
We published a detailed description of the API on Google.Colab4 and are ready to answer questions in the repository created for the project on GitHub. 5n the next section, we will give several micro examples of research tasks that can be addressed with the primary data.

Machine learning and classification of goods
Expert classification of the monitored items increases the cost of this approach or is simply not possible in the general case.International experience (see Barcaroli et al., 2015;Griffioen and Ten Bosch, 2016;Guerreiro et al., 2018;Myklatun, 2019) suggests that the updates to the product range (including new packaging varieties) can result in thousands of new items every month.
In this case, tools that automatically recognise the categories of goods can be especially handy.There is a rich and diverse family of classification models (see Hastie et al., 2009).However, the experience of researchers suggests that even relatively simple models (such as logistic regression) can achieve acceptable classification results (see Sands, 2020a for a comparative analysis based on the UK data).
A popular type of product that requires adjustments for changes in quality is the smartphone as its life cycle is short, and preferences and technology evolve more rapidly than, for example, food production technology.
As a result, recording the same price for a model of mobile phones over a number of years can lead to bias in estimates of inflation indices (for a detailed discussion, see Aizcorbe et al., 2020).The discussion by Silver and Heravi (2005) provides a central bank perspective on such recording of prices.It is naturally advisable to account for differences in the quality of representative goods, which prices are monitored, but it requires a sizeable sample of goods of the same class to identify the value of their different characteristics to the consumer.Da Silva et al. ( 2019) describe the needs of statistical services in Brazil for web scraping data and provide an example of constructing a hedonic regression based on it.
Our data allow recording prices for a wide range of available varieties of goods and enrich such datasets with their physical and other characteristics in order to estimate hedonic regressions.In Appendix 5 and on Google.Colab,9 we demonstrate a simple example of using the data to build a hedonic regression for a mid-priced mobile phone, an item whose price is monitored by Rosstat.

Building price indices
Researchers can use the raw data not only to monitor individual items but also to build any analytical index.For example, Figure 1 shows a food price index prepared using a methodology closely replicating the one proposed by Cavallo (2013).For this purpose, price increases for each monitored good are calculated within each category corresponding to the classes of representative goods in the Rosstat classifier; they are aggregated using the geometric mean formula.Next, the values obtained for each category are aggregated into the final index using the arithmetic mean with the official weights provided by Rosstat for each category.As shown in Figure 1, the index, thus, derived from online retailer data demonstrates a trend similar to that of the official Rosstat consumer price index for food products.
It should be noted that the differences in price trends based on both Rosstat data and our data are natural, objective and can be explained by fundamentally different geography of price collection (our data relate to the Moscow region, while Rosstat uses data collected across Russia), types of surveyed points of sale (we monitor mostly major retail chains, while Rosstat tracks all types of outlets), sample of goods (we use strict descriptions of representative goods, while Rosstat does not disclose specific items).
It should be noted that, potentially, the difference between the official indices and the indices built on our data could be informative and may be used when studying such issues as the impact of e-commerce development on price behaviour at traditional points of sale, diversion of price trends across Russian regions, etc.

Conclusion
As digital technology increasingly penetrates the economic life of society, this opens up new opportunities for researchers, in particular, when it comes to building new and improving existing statistical indicators.
In this paper, we addressed two tasks.First, we documented a new source of microdata on consumer prices that enable the researchers to access unprecedented granular and current data on a wide range of consumer goods and services.In addition, we provided micro examples of using data to address typical challenges in the classification of product items, monitoring price trends and market assessments of qualitative characteristics of goods.Second, we summarised our own and international experience in developing and maintaining proper sources of alternative data, with detailed recommendations on the structure of research teams and necessary technical skills to scale projects.
We are planning to develop this dataset by extending the geographical coverage, classifying items, and building product matches between retail chains.

Figure 1 .
Figure 1.Cavallo (2013) price index compared with the consumer food price index according to Rosstat

Table 1 .
Comparative characteristics of price monitoring technologies

Table 2 .
Technologies for alternative data collection