October 2019
 
 
by @mw

Dealing with long queries in Stata

Stata has several built-in limits in the engine. They mostly support the efficient memory allocation and overall make the commands run faster. For the majority of applications the limits are large enough so that the user will not even notice them (the detailed list of limits can be found here). This is, however, these one-in-a-million applications which may make Stata routines quite cumbersome.

The limit which I discovered recently was about the maximum macro length (or the maximum command length, difficult to judge). Even in Stata MP, the maximum number of characters in a macro can be up to 4,227,143 in Stata 14. In Stata 15 it is nearly 4 times more but as it did not fix the problem, I suspect that it is related to the command length rather than macro length.

I had to use the SQL query to select the records with certain identifiers. The average size of the identifier was about 13 characters. This plus additional two characters for bracketing and one character for coma separation, allowed me to pull some 264k unique identifiers at once. The problem was that I had to deal with a much larger set.

The following idea turned out to be a useful workaround of the Stata limits. The identifiers were in the data set under the variable name ID. The goal is to split the original datasets in slices, run the macro on each slice separately, execute the code, and combine the results. The code snippet which can be easily rearranged to meet different specifications and variable names is below.

/* slice the dataset in blocks of 500 */
egen slice = seq(), block(500) 

/* iterate the code for each slice */
glevelsof slice, local(lof_slice)
foreach i of local lof_slice {

	*preserve the original data set in memory
	preserve
	
	*keep only one slice at a time
	keep if slice == `i'
	di "Working on slice `i'"
	
	*build sql query
	qui glevelsof ID, local(lof_ID)
	local sql_in
	foreach id of local lof_ID {
		local sql_in `sql_in' '`id''
	}
	local sql_in: subinstr local sql_in "' '" "','", all
	
	*empty the data set
	clear
	
	*load the data from SQL query (replace [TABLE_NAME] and [IDENTIFIER] with relevant values
	qui odbc load, exec("SELECT * FROM [TABLE_NAME] WHERE [IDENTIFIER] IN ( `sql_in' );") noquote clear	
	
	*save/append the results (set PATH directory)
	gen slice = `i' 
	if(`i' > 1) {
		append using "PATH\\results.dta"
	}
	save "PATH\\results.dta", replace
          
	*restore the original data set
	restore

}	

Two caveats. In the code I rely on the command glevelsof, from the package gtools. It is a much faster alternative to the original levelsof command. Secondly, for the identifiers with more characters you may want to decrease the number of items per slice. In my application 500 offered a good balance between functionality and reliability, but sometimes it is possible to work with larger slices.

Leave your comment



M. Wolski
Marcin Wolski, PhD
Economist
European Investment Bank
E-mail: M.Wolski (at) eib.org
Phone: +352 43 79 88708

View my LinkedIn profile View my profile
View my IDEAS/RePEc profile  IDEAS/RePEc