Marcin Wolski

September 2020

Support your COVID-19 projects with easier data access

(This post was also published on medium.com)

The COVID-19 pandemic has profoundly shaken the world. No other event in recent history had led to such a revolution in our lifestyle, nor had forced countries to halt their economic activity by implementing nationwide lockdowns. As the crisis was unfolding rapidly, standard economic indicators were initially of little use. Alternative sources could offer a helping hand, but they often require digging through the piles of non-standardized data. This can be a huge problem if you want to quickly come up with a view on the topic. With this blog post we are going to show you how to get some of the alternative-data indicators just in the right shape, to support your COVID-19 project.

An example: Covid-19 and electricity

In general, GDP and economic aggregates are easi...

Big Data

February 2020

Working with satellite data - part 1

I already emphasized that Big Data changes the way data analysts think and work. Even hedge funds and stock traders adjust their positions in reaction to information delivered through alternative data channels. For instance, the flight information that a jet belonging to Occidental Petroleum landed at an airport in Omaha, helped the traders anticipate that the company was about to receive an extra capital injection of USD 10bn from Warren Buffett (read FT article). While many of the alternative data sets are premium, there is a fairly large amount available to the public. For quite some time I have been curious to explore them in detail and, as it turned out, a small proof-of-concept project offered an opportunity to do it.

The goal was to check if the satell...

Stata programming

October 2019

Dealing with long queries in Stata

Stata has several built-in limits in the engine. They mostly support the efficient memory allocation and overall make the commands run faster. For the majority of applications the limits are large enough so that the user will not even notice them (the detailed list of limits can be found here). This is, however, these one-in-a-million applications which may make Stata routines quite cumbersome.

The limit which I discovered recently was about the maximum macro length (or the maximum command length, difficult to judge). Even in Stata MP, the maximum number of characters in a macro can be up to 4,227,143 in Stata 14. In Stata 15 it is nearly 4 times more but as it did not fix the problem, I suspect that it is related to the command length rather than macro length.

I had to use the SQL query to select the records with certain identifiers. The average size of the identifier was...

Big Data

September 2019

Demystifying Big Data

The topic of big data has been recently in the spotlight, coming to the attention of not only data scientists and researchers but even journalists and Internet enthusiasts. As a buzzword it has penetrated the environment to such a great extent that nearly everybody that I speak with asks if my methodological papers can be applied to big data. And honestly, after browsing through countless resources and talking with multiple experts, I am finally able to conclude: yes. But coming to this conclusion required putting some big data concepts together, which I could not find written explicitly elsewhere. I therefore depict my personal point of view below.

For me, the topic of big data was difficult to comprehend due to its terminology, which comes from across multiple disciplines. Starting with the term ‘big data’, what distinguishes between the fields is not the word ‘data’, but rather the word ‘big’. Of course, different sciences work with di...

R programming

April 2019

The fastest density estimation in R

There are many packages dealing with density estimation in R. They offer several advantages over manually coded algorithms, including bandwidth-selection procedures or involving some more complex features of density estimation, like derivative estimation or higher order kernels. Some of them are also coded in native C language, which should speed up the calculations and enhance memory management. Nevertheless, many of these extra features may be often unused in simple applications of density estimations. That leaves the question open: which algorithm is the fastest?

I try to look at the broadest possible set of R packages dealing with density estimation, including

np

kdensity

sm

and

GenKern

. There are some other packages which I skip here, as I wanted to make sure I estimate the density at a given point, and not across the domain, to make the numbers comparable. (As a s...

R programming

March 2019

The R way to explore APIs

Data access is often a nightmare. Especially with irregular data shapes or multiple data types. APIs, or application programming interfaces, offer a simple access gates to the information resources in their native structures, and therefore they offer a powerful tool to quickly boost many research projects.

In a nutshell, an API is a gate through which a user may access the resources or data located on a server in a quick and friendly way. APIs have a generic address, typically in the form of http address, and endpoints. Endpoints direct the user to specific parts of of the database (like tables), the user may need to access. APIs require an authentication key, called a token, which offers the server the access control mechanism. Sometimes you need to pay for a token, but oftentimes some limited functionality is offered for free.

To demonstrate the performance of the API, I will access the trading database of

R programming

February 2019

Testing conditional expectations

I recently came across a problem of testing if the expectations of one variable, call it $Y$, vary alongside the distribution of another variable, say $X$. The problem can be approached through several angles, including parametric quantile approach, however, it was decided to use one of the most flexible methods, and actually one of my favorites, i.e. the bootstrap.

The idea is quite simple. Imagine two random variables $Y$ and $X$. (For more information about the exact definitions of what a random variable is, the Wikipedia page has a lot of useful information.) Given their observed realisations $\{(Y_i,X_i):i=1,...,n\}$, the goal is to test if the conditional average of $Y$ is statistically different from its unconditional average. We can approximate the former by estimating the mean of $Y$ for different parts of $X$ distribution. For instance, we can test if expectations of $Y$ d...

Pages: [ 1 ] [ 2 ]

Marcin Wolski, PhD
Climate Economist
European Investment Bank
E-mail: M.Wolski (at) eib.org
Phone: +352 43 79 88708

View my profile
View my IDEAS/RePEc profile

IDEAS/RePEc