First of all, you need to install the Anaconda Prompt (conda). Tou have to follow the instructions on the conda website: https://conda.io/projects/conda/en/latest/user-guide/install/windows.html.
Then, you have to install the jupyter notebook. The instructions are available on their website: https://jupyter.org/install.
In a third step, you have to install PyStata: https://www.stata.com/python/pystata/install.html.
In Stata, enter the following command to locate the folder containing Stata 18 : display c(sysdir_stata)
The software is in this folder 'C:\Program Files\Stata18/' in my case and I use the standard edition.
In conda, install PyStata : pip install --upgrade --user stata_setup
Lastly, you have to install the matplotlib library for an illustration with a 3D graph: https://matplotlib.org/stable/users/installing.html
In conda, you can type:
python -m pip install -U pip
python -m pip install -U matplotlib
Now, you are ready to launch Stata from the Jupyter Notebook.
import stata_setup
stata_setup.config('C:\Program Files\Stata18/', 'se')
___ ____ ____ ____ ____ ®
/__ / ____/ / ____/ 18.0
___/ / /___/ / /___/ SE—Standard Edition
Statistics and Data Science Copyright 1985-2023 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-STATA-PC https://www.stata.com
979-696-4600 stata@stata.com
Stata license: Single-user perpetual
Serial number: 401806202105
Licensed to: Jamel Saadaoui
University of Strasbourg
Notes:
1. Unicode is supported; see help unicode_advice.
2. Maximum number of variables is set to 5,000 but can be increased;
see help set_maxvar.
%%stata
sysuse auto, clear
summarize mpg
summarize
.
. sysuse auto, clear
(1978 automobile data)
.
. summarize mpg
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
.
. summarize
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+---------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+---------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
.
%%stata
*twoway (histogram mpg, color(red%30))
. . *twoway (histogram mpg, color(red%30)) .
import pandas as pd
import io
import requests
data = requests.get("https://www.jamelsaadaoui.com/wp-content/uploads/2023/05/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode("utf-8")))
%%stata -d nhanes2
sum
.
. sum
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sampl | 10,351 33625.57 18412.28 1400 64709
strata | 10,351 16.6667 9.497287 1 32
psu | 0
region | 0
smsa | 0
-------------+---------------------------------------------------------
location | 10,351 33.0655 18.41254 1 64
houssiz | 10,351 2.943774 1.695156 1 14
sex | 0
race | 0
age | 10,351 47.57965 17.21483 20 74
-------------+---------------------------------------------------------
height | 10,351 167.6509 9.655916 135.5 200
weight | 10,351 71.89752 15.35642 30.84 175.88
bpsystol | 10,351 130.8817 23.33265 65 300
bpdiast | 10,351 81.715 12.92722 35 150
tcresult | 10,351 217.6697 49.38694 80 828
-------------+---------------------------------------------------------
tgresult | 5,050 143.8958 96.49629 16 2238
hdresult | 8,720 49.64266 14.31176 15 187
hgb | 10,351 14.26044 1.384677 6.9 20.2
hct | 10,351 41.98648 3.67368 20.2 60.7
tibc | 10,351 366.986 55.64079 157 792
-------------+---------------------------------------------------------
iron | 10,351 99.44595 34.08279 16 321
hlthstat | 0
heartatk | 0
diabetes | 0
sizplace | 0
-------------+---------------------------------------------------------
finalwgt | 10,351 11318.47 7304.04 2000 79634
leadwt | 10,351 11283.84 15011.21 0 81601
corpuscl | 10,262 89.96747 5.52525 58.3 125.9
trnsfern | 10,351 27.60282 10.04152 3.1 94.3
albumin | 10,016 4.669159 .331095 3 5.8
-------------+---------------------------------------------------------
vitaminc | 9,973 1.034814 .5813791 .1 18.1
zinc | 9,202 86.50739 14.47822 43 240
copper | 9,131 125.6094 32.52205 37 346
porphyrn | 10,270 53.67429 25.71968 20 1307
lead | 4,948 14.32033 6.166468 2 80
-------------+---------------------------------------------------------
female | 0
black | 0
orace | 0
fhtatk | 0
hsizgp | 10,351 2.790938 1.332086 1 5
-------------+---------------------------------------------------------
hsiz1 | 10,351 .167037 .3730269 0 1
hsiz2 | 10,351 .3506908 .4772093 0 1
hsiz3 | 10,351 .1682929 .3741443 0 1
hsiz4 | 10,351 .1522558 .359286 0 1
hsiz5 | 10,351 .1617235 .3682148 0 1
-------------+---------------------------------------------------------
region1 | 10,351 .2024925 .4018767 0 1
region2 | 10,351 .2679934 .4429356 0 1
region3 | 10,351 .2756255 .4468505 0 1
region4 | 10,351 .2538885 .4352556 0 1
smsa1 | 10,351 .2542749 .4354739 0 1
-------------+---------------------------------------------------------
smsa2 | 10,351 .2905999 .4540612 0 1
smsa3 | 10,351 .4551251 .4980062 0 1
rural | 0
loglead | 4,948 2.577758 .4115249 .6931472 4.382027
agegrp | 0
-------------+---------------------------------------------------------
highlead | 0
bmi | 10,351 25.5376 4.914969 12.3856 61.1297
highbp | 10,351 .4227611 .494022 0 1
.
%%stata
logistic highbp c.age##c.weight
*ereturn list
.
. logistic highbp c.age##c.weight
Logistic regression Number of obs = 10,351
LR chi2(3) = 2381.23
Prob > chi2 = 0.0000
Log likelihood = -5860.1512 Pseudo R2 = 0.1689
------------------------------------------------------------------------------
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.108531 .0080697 14.15 0.000 1.092827 1.12446
weight | 1.081505 .005516 15.36 0.000 1.070748 1.092371
|
c.age#|
c.weight | .9992788 .0000977 -7.38 0.000 .9990873 .9994703
|
_cons | .0002025 .0000787 -21.89 0.000 .0000946 .0004335
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
.
. *ereturn list
.
%%stata -eret myeret
logistic highbp c.age##c.weight
Logistic regression Number of obs = 10,351
LR chi2(3) = 2381.23
Prob > chi2 = 0.0000
Log likelihood = -5860.1512 Pseudo R2 = 0.1689
------------------------------------------------------------------------------
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.108531 .0080697 14.15 0.000 1.092827 1.12446
weight | 1.081505 .005516 15.36 0.000 1.070748 1.092371
|
c.age#|
c.weight | .9992788 .0000977 -7.38 0.000 .9990873 .9994703
|
_cons | .0002025 .0000787 -21.89 0.000 .0000946 .0004335
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
myeret
{'e(rank)': 4.0,
'e(N)': 10351.0,
'e(ic)': 4.0,
'e(k)': 4.0,
'e(k_eq)': 1.0,
'e(noconstant)': 0.0,
'e(consonly)': 0.0,
'e(k_dv)': 1.0,
'e(converged)': 1.0,
'e(rc)': 0.0,
'e(k_autoCns)': 0.0,
'e(ll)': -5860.151215434775,
'e(k_eq_model)': 1.0,
'e(ll_0)': -7050.765484416371,
'e(df_m)': 3.0,
'e(chi2)': 2381.228537963192,
'e(p)': 0.0,
'e(N_cdf)': 0.0,
'e(N_cds)': 0.0,
'e(r2_p)': 0.1688631215451849,
'e(cmdline)': 'logistic highbp c.age##c.weight',
'e(cmd)': 'logistic',
'e(predict)': 'logistic_p',
'e(estat_cmd)': 'logit_estat',
'e(marginsderiv)': 'default Pr',
'e(marginsok)': 'default Pr',
'e(marginsnotok)': 'stdp DBeta DEviance DX2 DDeviance Hat Number Residuals RStandard SCore',
'e(title)': 'Logistic regression',
'e(chi2type)': 'LR',
'e(deriv_useminbound)': 'off',
'e(opt)': 'moptimize',
'e(vce)': 'oim',
'e(user)': 'mopt__logit_d2()',
'e(crittype)': 'log likelihood',
'e(ml_method)': 'd2',
'e(singularHmethod)': 'm-marquardt',
'e(technique)': 'nr',
'e(which)': 'max',
'e(depvar)': 'highbp',
'e(properties)': 'b V',
'e(b)': array([[ 1.03035512e-01, 7.83537336e-02, -7.21492364e-04,
-8.50485074e+00]]),
'e(V)': array([[ 5.29930775e-05, 3.50509320e-05, -6.97861007e-07,
-2.69423165e-03],
[ 3.50509320e-05, 2.60132665e-05, -4.74084055e-07,
-1.94299577e-03],
[-6.97861007e-07, -4.74084055e-07, 9.55811843e-09,
3.50377702e-05],
[-2.69423165e-03, -1.94299577e-03, 3.50377702e-05,
1.50887843e-01]]),
'e(mns)': array([[4.75796541e+01, 7.18975210e+01, 3.43111878e+03, 1.00000000e+00]]),
'e(rules)': array([[0., 0., 0., 0.]]),
'e(ilog)': array([[-7050.76548442, -5886.45946281, -5860.19509926, -5860.15121566,
-5860.15121543, 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ]]),
'e(gradient)': array([[5.49611647e-06, 8.04060803e-06, 5.24302277e-04, 8.32616448e-08]])}
e_v = myeret['e(V)']
e_v
array([[ 5.29930775e-05, 3.50509320e-05, -6.97861007e-07,
-2.69423165e-03],
[ 3.50509320e-05, 2.60132665e-05, -4.74084055e-07,
-1.94299577e-03],
[-6.97861007e-07, -4.74084055e-07, 9.55811843e-09,
3.50377702e-05],
[-2.69423165e-03, -1.94299577e-03, 3.50377702e-05,
1.50887843e-01]])
%%stata
quietly margins, at(age=(20(10)80))
*marginsplot
. . quietly margins, at(age=(20(10)80)) . . *marginsplot .
%%stata -doutd preddata
quietly margins, at(age=(20(5)80) weight=(40(5)180)) saving(predictions, replace)
use predictions, clear
list _at1 _at2 _margin in 1/5
rename _at1 age
rename _at2 weight
rename _margin pr_highbp
.
. quietly margins, at(age=(20(5)80) weight=(40(5)180)) saving(predictions, repl
> ace)
.
. use predictions, clear
(Created by command margins; also see char list)
.
. list _at1 _at2 _margin in 1/5
+------------------------+
| _at1 _at2 _margin |
|------------------------|
1. | 20 40 .0200911 |
2. | 20 45 .0274497 |
3. | 20 50 .0374008 |
4. | 20 55 .0507709 |
5. | 20 60 .0685801 |
+------------------------+
.
. rename _at1 age
.
. rename _at2 weight
.
. rename _margin pr_highbp
.
import matplotlib.pyplot as plt
import numpy as np
# define the axes
fig = plt.figure(1, figsize=(10, 8))
ax = plt.axes(projection='3d')
# plot
ax.plot_trisurf(preddata['age'], preddata['weight'], preddata['pr_highbp'],cmap=plt.cm.Spectral_r)
# set ticks and labels for x, y, and z axes
ax.set_xticks(np.arange(20, 90, step=10))
ax.set_yticks(np.arange(40, 200, step=40))
ax.set_zticks(np.arange( 0, 1.2, step=0.2))
ax.set_xlabel("Age (years)")
ax.set_ylabel("Weight (kg)")
ax.set_zlabel("Probability of Hypertension")
# adjust the view angle
ax.view_init(elev=30, azim=240)
# show the plot
plt.show()