Awesome Stata 17 Upgrade: Trying Pystata by using PPML on USITC and WITS database
Finally got my hand on Stata 17. When I got my email about Stata 17, I was very excited on the new PyStata feature. It allows for calling python from Stata and calling Stata from Python. There are many other cool upgrades on Stata 17 especially for Macroeconometrician, but the PyStata is especially interesting for me!
It doesn’t take long for me to try using PyStata on my Jupyter Lab. While using Python on Stata is certainly super interesting, I of course like better using Stata on Jupyter. I write things from Jupyter (like this post), so running things straight from Jupyter would be super awesome. While there are many things we can do with Python, it’s still hard for me to say goodbye to Stata. For one, I use packages like Levinsohn-Petrin algorithm for my research. Moreover, Stata still has plenty of familiarity for my own compared with Pandas. Lastly, most of my colleagues don’t even speak python or r, so Stata remains to be my main tool for collaboration.
So this post is dedicated to myself being super excited to Trying PyStata for the first time! As a trade economist, what’s best but to start with some trade data? In this post, I take data from the internet with Pandas, then transfer it to Stata database (you know, the .dta), and then run a PPML a la Silva and Tenreyro. Yes there is a PPML package for Python called GME (I even blogged about it) but come on I am having fun with PyStata here!
There is of course a better starting place to read if you really want to go on with playing with your PyStata, but I hope this post is fun and useful enuff four you.
Since this post is all about using Stata command in Python environment (and not the other way around), I mainly use Stata Magic. Maybe if I manage to use Mata Magic and PyStata Magic I will blog about it as well but the Stata Magic is what I am super excited about for now.
Okay lets go on to the show!
Installing PyStata
obviously the first thing you should do is to have Stata 17 installed on your computer. Remember where it is installed (it will be used later) but if you just click next
like I do then it’s gonna be in C:/Program Files/Stata17/
. When we’re done installing Stata 17, then all we need to do is to install PyStata in our env using
pip install --upgrade --user stata_setup
Then we’re good to go!
Collecting the data (in Pandas)
I am using publicly available data for this post so everyone can try this. No, not car, don’t worry.
Since this will be about gravity regression, I will need gravity data. Let’s get it from USITC Dynamic Gravity Dataset for our trial here. Be mindful of the size, this dataset is suuuuuuuuuper huge!
weqs=pd.read_csv('https://www.usitc.gov/data/gravity/dgd_docs/release_2.1_2000_2016.zip')
weqs.columns # checking its variables
Index(['year', 'country_o', 'iso3_o', 'dynamic_code_o', 'country_d', 'iso3_d',
'dynamic_code_d', 'colony_of_destination_ever', 'colony_of_origin_ever',
'colony_ever', 'common_colonizer', 'common_legal_origin', 'contiguity',
'distance', 'member_gatt_o', 'member_wto_o', 'member_eu_o',
'member_gatt_d', 'member_wto_d', 'member_eu_d', 'member_gatt_joint',
'member_wto_joint', 'member_eu_joint', 'lat_o', 'lng_o', 'lat_d',
'lng_d', 'landlocked_o', 'island_o', 'region_o', 'landlocked_d',
'island_d', 'region_d', 'agree_pta_goods', 'agree_pta_services',
'agree_fta', 'agree_eia', 'agree_cu', 'agree_psa', 'agree_fta_eia',
'agree_cu_eia', 'agree_pta', 'capital_const_d', 'capital_const_o',
'capital_cur_d', 'capital_cur_o', 'gdp_pwt_const_d', 'gdp_pwt_const_o',
'gdp_pwt_cur_d', 'gdp_pwt_cur_o', 'pop_d', 'pop_o', 'hostility_level_o',
'hostility_level_d', 'common_language', 'polity_o', 'polity_d',
'sanction_threat', 'sanction_threat_trade', 'sanction_imposition',
'sanction_imposition_trade', 'gdp_wdi_cur_o', 'gdp_wdi_cap_cur_o',
'gdp_wdi_const_o', 'gdp_wdi_cap_const_o', 'gdp_wdi_cur_d',
'gdp_wdi_cap_cur_d', 'gdp_wdi_const_d', 'gdp_wdi_cap_const_d'],
dtype='object')
Since this is a test, I will use just some of the variable. First, I will only use Indonesia as a country of origin. Secondly, I will only use cross-section in the year 2016. Next, only several variable will be used, namely, export partner’s population, GDP and distance, as well as FTA with Indonesia. Obviously there are more stuff needs to be added if you want to do a real gravity analysis but this is just a test come on.
grav=weqs.query('country_o == "Indonesia"') # take only Indonesia as a country of origin
grav=grav.query('year==2016') # take only year 2016
grav=grav[['country_d','pop_d', 'agree_fta', 'gdp_wdi_cur_d', 'distance']] # take some variables only
grav=grav.rename(columns={'country_d':'Partner'})
grav.head() # check content
Partner | pop_d | agree_fta | gdp_wdi_cur_d | distance | |
---|---|---|---|---|---|
455937 | Aruba | 0.104822 | 0.0 | 2.646927e+09 | 18924.247763 |
455954 | Afghanistan | NaN | 0.0 | 1.936264e+10 | 6158.679225 |
455971 | Angola | 28.813463 | 0.0 | 1.011239e+11 | 10483.002562 |
455988 | Anguilla | 0.014764 | 0.0 | NaN | 18167.048245 |
456005 | Aland Islands | NaN | 0.0 | NaN | 10410.128664 |
Now that i have the gravity variables, let’s get the trade data. I will need Indonesia’s export value to partners as well as their import tariff. Again, we normally need to do more rigorous check on the export and tariff, but since this is just a test, I will use total export as well as MFN weighted average tariff.
To get from WITS i use world_trade_data package in python.
import world_trade_data as wits
# Grabbing Indonesia's export value to partners in 2016
exports = wits.get_indicator('XPRT-TRD-VL', reporter='idn',partner='all', year='2016',product='Total')
exports=exports.reset_index()
exports=exports[['Partner','Value']] # Take only necessary columns
exports=exports.rename(columns={'Value':'Export_Value'}) # rename to merge later
# Grabbing tariff from Indonesia's partners
tariff=wits.get_indicator('MFN-WGHTD-AVRG',reporter='all',year='2016',datasource='tradestats-tariff',product='Total')
tariff=tariff.reset_index()
tariff=tariff[['Reporter','Value']] # take only necessary columns
tariff=tariff.rename(columns={"Reporter": "Partner","Value":"Tariff"}) # rename to merge later
trade=pd.merge(exports,tariff,how='left',on='Partner') # merge the two dataset
now that I have my trade,tariff and gravity data, lets merge them to get the final data to feed to our Stata PPML.
idntrade=pd.merge(grav,trade,how='left',on='Partner')
idntrade['Export']=idntrade['Export_Value']/1000000
idntrade.head()
Partner | pop_d | agree_fta | gdp_wdi_cur_d | distance | Export_Value | Tariff | Export | |
---|---|---|---|---|---|---|---|---|
0 | Aruba | 0.104822 | 0.0 | 2.646927e+09 | 18924.247763 | 878.469 | 10.158305 | 0.000878 |
1 | Afghanistan | NaN | 0.0 | 1.936264e+10 | 6158.679225 | 16220.103 | NaN | 0.016220 |
2 | Angola | 28.813463 | 0.0 | 1.011239e+11 | 10483.002562 | 57713.955 | 9.379807 | 0.057714 |
3 | Anguilla | 0.014764 | 0.0 | NaN | 18167.048245 | NaN | NaN | NaN |
4 | Aland Islands | NaN | 0.0 | NaN | 10410.128664 | NaN | NaN | NaN |
Well, obviously we can merge and cook our data ini Stata as well but as you can see it’s kinda faster to do it in Pandas unless I need to generate anything new.
Run a PPML in Stata
Now that the data’s ready, we can begin running PPML in Stata! Now, we need to call Stata to our Python environment with the import
command. This is where we need our Stata location.
import stata_setup
stata_setup.config("C:/Program Files/Stata17/", "mp")
___ ____ ____ ____ ____ ©
/__ / ____/ / ____/ 17.0
___/ / /___/ / /___/ MP—Parallel Edition
Statistics and Data Science Copyright 1985-2021 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-STATA-PC https://www.stata.com
979-696-4600 stata@stata.com
Look at how georgeous Stata logo is.
Next, we load the dataset to Stata! I use Stata magic to do this. there is also a possibility calling Stata with API like this one and this one, but in Jupyter, using magic is somewhat easier for me.
Now lets call the data and describe it.
%%stata -d idntrade -force
describe
Contains data
Observations: 255
Variables: 8
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
Partner str45 %45s
pop_d double %10.0g
agree_fta double %10.0g
gdp_wdi_cur_d double %10.0g
distance double %10.0g
Export_Value double %10.0g
Tariff double %10.0g
Export double %10.0g
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Wooohoo cool! Let’s try summarize
%%stata
su
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
Partner | 0
pop_d | 183 42.87855 150.9889 .005152 1403.5
agree_fta | 255 .0588235 .2357568 0 1
gdp_wdi_cu~d | 206 5.25e+11 2.26e+12 3.65e+07 1.87e+13
distance | 253 10594.37 4663.103 951.1062 19190.28
-------------+---------------------------------------------------------
Export_Value | 192 676920.4 2420380 .101 1.68e+07
Tariff | 119 6.739741 4.004538 0 18.73486
Export | 192 .6769204 2.42038 1.01e-07 16.78559
Awesome!! Now let’s do our ppml command.
%%stata
ppml Export Tariff pop_d agree_fta gdp_wdi_cur_d distance
note: checking the existence of the estimates
WARNING: Tariff has very large values, consider rescaling or recentering
WARNING: pop_d has very large values, consider rescaling or recentering
WARNING: gdp_wdi_cur_d has very large values, consider rescaling or recenterin
> g
WARNING: distance has very large values, consider rescaling or recentering
note: starting ppml estimation
note: Export has noninteger values
Iteration 1: deviance = 118.6375
Iteration 2: deviance = 101.5413
Iteration 3: deviance = 101.0093
Iteration 4: deviance = 101.0082
Iteration 5: deviance = 101.0082
Number of parameters: 6
Number of observations: 114
Number of observations dropped: 0
Pseudo log-likelihood: -102.58858
R-squared: .75786551
------------------------------------------------------------------------------
| Robust
Export | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Tariff | -.0906943 .0542772 -1.67 0.095 -.1970757 .015687
pop_d | -.0002608 .0004712 -0.55 0.580 -.0011843 .0006626
agree_fta | 2.28588 .5362387 4.26 0.000 1.234871 3.336888
gdp_wdi_cu~d | 2.04e-13 3.58e-14 5.69 0.000 1.33e-13 2.74e-13
distance | -.0000199 .000065 -0.31 0.760 -.0001473 .0001075
_cons | -.488918 .7264721 -0.67 0.501 -1.912777 .9349412
------------------------------------------------------------------------------
Number of regressors dropped to ensure that the estimates exist: 0
Option strict is off
woohoo! The result is also great considering how under-specified the model is! High $R^2$ , Tariff is negative and strong, FTA and GDP are both highly important in determining foreign demand!
Now one thing I can’t quite work out is to extract the information from the regression back to pandas. I usually just go with outreg and do it manually, but it’s no fun if I can’t integrate the regression result in my notebook.
What I can do is to export the result to csv and read it thru Pandas.
%%stata
eststo
esttab using hehe.csv, plain replace wide
. eststo
(est1 stored)
. esttab using hehe.csv, plain replace wide
(output written to hehe.csv)
.
dff=pd.read_csv('hehe.csv')
dff
Unnamed: 0 | est1 | Unnamed: 2 | |
---|---|---|---|
0 | NaN | b | t |
1 | Tariff | -.0906943 | -1.670947 |
2 | pop_d | -.0002608 | -.5535483 |
3 | agree_fta | 2.28588 | 4.262802 |
4 | gdp_wdi_cur_d | 2.04e-13 | 5.688106 |
5 | distance | -.0000199 | -.3056529 |
6 | _cons | -.488918 | -.6730031 |
7 | N | 114 | NaN |
The result is not the best so this is kinda bummer. But at least I have something. haha. There may be a better way to do this but let me look for it and post it later on. If you have a suggestion please mention me on twitter!
Anyway, that’s it for now. I hope this post is enjoyable!