September 2020
 

(This post was also published on medium.com)

The COVID-19 pandemic has profoundly shaken the world. No other event in recent history had led to such a revolution in our lifestyle, nor had forced countries to halt their economic activity by implementing nationwide lockdowns. As the crisis was unfolding rapidly, standard economic indicators were initially of little use. Alternative sources could offer a helping hand, but they often require digging through the piles of non-standardized data. This can be a huge problem if you want to quickly come up with a view on the topic. With this blog post we are going to show you how to get some of the alternative-data indicators just in the right shape, to support your COVID-19 project.

An example: Covid-19 and electricity

In general, GDP and economic aggregates are easi...

February 2020
 

I already emphasized that Big Data changes the way data analysts think and work. Even hedge funds and stock traders adjust their positions in reaction to information delivered through alternative data channels. For instance, the flight information that a jet belonging to Occidental Petroleum landed at an airport in Omaha, helped the traders anticipate that the company was about to receive an extra capital injection of USD 10bn from Warren Buffett (read FT article). While many of the alternative data sets are premium, there is a fairly large amount available to the public. For quite some time I have been curious to explore them in detail and, as it turned out, a small proof-of-concept project offered an opportunity to do it.

The goal was to check if the satell...

October 2019
 

Stata has several built-in limits in the engine. They mostly support the efficient memory allocation and overall make the commands run faster. For the majority of applications the limits are large enough so that the user will not even notice them (the detailed list of limits can be found here). This is, however, these one-in-a-million applications which may make Stata routines quite cumbersome.

The limit which I discovered recently was about the maximum macro length (or the maximum command length, difficult to judge). Even in Stata MP, the maximum number of characters in a macro can be up to 4,227,143 in Stata 14. In Stata 15 it is nearly 4 times more but as it did not fix the problem, I suspect that it is related to the command length rather than macro length.

I had to use the SQL query to select the records with certain identifiers. The average size of the identifier was...

September 2019
 

The topic of big data has been recently in the spotlight, coming to the attention of not only data scientists and researchers but even journalists and Internet enthusiasts. As a buzzword it has penetrated the environment to such a great extent that nearly everybody that I speak with asks if my methodological papers can be applied to big data. And honestly, after browsing through countless resources and talking with multiple experts, I am finally able to conclude: yes. But coming to this conclusion required putting some big data concepts together, which I could not find written explicitly elsewhere. I therefore depict my personal point of view below.

For me, the topic of big data was difficult to comprehend due to its terminology, which comes from across multiple disciplines. Starting with the term ‘big data’, what distinguishes between the fields is not the word ‘data’, but rather the word ‘big’. Of course, different sciences work with di...

April 2019
 

There are many packages dealing with density estimation in R. They offer several advantages over manually coded algorithms, including bandwidth-selection procedures or involving some more complex features of density estimation, like derivative estimation or higher order kernels. Some of them are also coded in native C language, which should speed up the calculations and enhance memory management. Nevertheless, many of these extra features may be often unused in simple applications of density estimations. That leaves the question open: which algorithm is the fastest?

I try to look at the broadest possible set of R packages dealing with density estimation, including
np
,
kdensity
,
sm
and
GenKern
. There are some other packages which I skip here, as I wanted to make sure I estimate the density at a given point, and not across the domain, to make the numbers comparable. (As a s...

March 2019
 

Data access is often a nightmare. Especially with irregular data shapes or multiple data types. APIs, or application programming interfaces, offer a simple access gates to the information resources in their native structures, and therefore they offer a powerful tool to quickly boost many research projects.

In a nutshell, an API is a gate through which a user may access the resources or data located on a server in a quick and friendly way. APIs have a generic address, typically in the form of http address, and endpoints. Endpoints direct the user to specific parts of of the database (like tables), the user may need to access. APIs require an authentication key, called a token, which offers the server the access control mechanism. Sometimes you need to pay for a token, but oftentimes some limited functionality is offered for free.

To demonstrate the performance of the API, I will access the trading database of

R programming
February 2019
 

I recently came across a problem of testing if the expectations of one variable, call it $Y$, vary alongside the distribution of another variable, say $X$. The problem can be approached through several angles, including parametric quantile approach, however, it was decided to use one of the most flexible methods, and actually one of my favorites, i.e. the bootstrap.

The idea is quite simple. Imagine two random variables $Y$ and $X$. (For more information about the exact definitions of what a random variable is, the Wikipedia page has a lot of useful information.) Given their observed realisations $\{(Y_i,X_i):i=1,...,n\}$, the goal is to test if the conditional average of $Y$ is statistically different from its unconditional average. We can approximate the former by estimating the mean of $Y$ for different parts of $X$ distribution. For instance, we can test if expectations of $Y$ d...


M. Wolski
Marcin Wolski, PhD
Advisor to Vice-President
European Investment Bank
E-mail: M.Wolski (at) eib.org
Phone: +352 43 79 88708

View my LinkedIn profile View my profile
View my IDEAS/RePEc profile  IDEAS/RePEc