First of all, you need to install the Anaconda Prompt (conda). Tou have to follow the instructions on the conda website: https://conda.io/projects/conda/en/latest/user-guide/install/windows.html.
Then, you have to install the jupyter notebook. The instructions are available on their website: https://jupyter.org/install.
In a third step, you have to install PyStata: https://www.stata.com/python/pystata/install.html.
In Stata, enter the following command to locate the folder containing Stata 18 : display c(sysdir_stata)
The software is in this folder 'C:\Program Files\Stata18/' in my case and I use the standard edition.
In conda, install PyStata : pip install --upgrade --user stata_setup
Lastly, you have to install the matplotlib library for an illustration with a 3D graph: https://matplotlib.org/stable/users/installing.html
In conda, you can type:
python -m pip install -U pip
python -m pip install -U matplotlib
Now, you are ready to launch Stata from the Jupyter Notebook.
import stata_setup
stata_setup.config('C:\Program Files\Stata18/', 'se')
___ ____ ____ ____ ____ ® /__ / ____/ / ____/ 18.0 ___/ / /___/ / /___/ SE—Standard Edition Statistics and Data Science Copyright 1985-2023 StataCorp LLC StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC https://www.stata.com 979-696-4600 stata@stata.com Stata license: Single-user perpetual Serial number: 401806202105 Licensed to: Jamel Saadaoui University of Strasbourg Notes: 1. Unicode is supported; see help unicode_advice. 2. Maximum number of variables is set to 5,000 but can be increased; see help set_maxvar.
%%stata
sysuse auto, clear
summarize mpg
summarize
. . sysuse auto, clear (1978 automobile data) . . summarize mpg Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- mpg | 74 21.2973 5.785503 12 41 . . summarize Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- make | 0 price | 74 6165.257 2949.496 3291 15906 mpg | 74 21.2973 5.785503 12 41 rep78 | 69 3.405797 .9899323 1 5 headroom | 74 2.993243 .8459948 1.5 5 -------------+--------------------------------------------------------- trunk | 74 13.75676 4.277404 5 23 weight | 74 3019.459 777.1936 1760 4840 length | 74 187.9324 22.26634 142 233 turn | 74 39.64865 4.399354 31 51 displacement | 74 197.2973 91.83722 79 425 -------------+--------------------------------------------------------- gear_ratio | 74 3.014865 .4562871 2.19 3.89 foreign | 74 .2972973 .4601885 0 1 .
%%stata
*twoway (histogram mpg, color(red%30))
. . *twoway (histogram mpg, color(red%30)) .
import pandas as pd
import io
import requests
data = requests.get("https://www.jamelsaadaoui.com/wp-content/uploads/2023/05/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode("utf-8")))
%%stata -d nhanes2
sum
. . sum Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- sampl | 10,351 33625.57 18412.28 1400 64709 strata | 10,351 16.6667 9.497287 1 32 psu | 0 region | 0 smsa | 0 -------------+--------------------------------------------------------- location | 10,351 33.0655 18.41254 1 64 houssiz | 10,351 2.943774 1.695156 1 14 sex | 0 race | 0 age | 10,351 47.57965 17.21483 20 74 -------------+--------------------------------------------------------- height | 10,351 167.6509 9.655916 135.5 200 weight | 10,351 71.89752 15.35642 30.84 175.88 bpsystol | 10,351 130.8817 23.33265 65 300 bpdiast | 10,351 81.715 12.92722 35 150 tcresult | 10,351 217.6697 49.38694 80 828 -------------+--------------------------------------------------------- tgresult | 5,050 143.8958 96.49629 16 2238 hdresult | 8,720 49.64266 14.31176 15 187 hgb | 10,351 14.26044 1.384677 6.9 20.2 hct | 10,351 41.98648 3.67368 20.2 60.7 tibc | 10,351 366.986 55.64079 157 792 -------------+--------------------------------------------------------- iron | 10,351 99.44595 34.08279 16 321 hlthstat | 0 heartatk | 0 diabetes | 0 sizplace | 0 -------------+--------------------------------------------------------- finalwgt | 10,351 11318.47 7304.04 2000 79634 leadwt | 10,351 11283.84 15011.21 0 81601 corpuscl | 10,262 89.96747 5.52525 58.3 125.9 trnsfern | 10,351 27.60282 10.04152 3.1 94.3 albumin | 10,016 4.669159 .331095 3 5.8 -------------+--------------------------------------------------------- vitaminc | 9,973 1.034814 .5813791 .1 18.1 zinc | 9,202 86.50739 14.47822 43 240 copper | 9,131 125.6094 32.52205 37 346 porphyrn | 10,270 53.67429 25.71968 20 1307 lead | 4,948 14.32033 6.166468 2 80 -------------+--------------------------------------------------------- female | 0 black | 0 orace | 0 fhtatk | 0 hsizgp | 10,351 2.790938 1.332086 1 5 -------------+--------------------------------------------------------- hsiz1 | 10,351 .167037 .3730269 0 1 hsiz2 | 10,351 .3506908 .4772093 0 1 hsiz3 | 10,351 .1682929 .3741443 0 1 hsiz4 | 10,351 .1522558 .359286 0 1 hsiz5 | 10,351 .1617235 .3682148 0 1 -------------+--------------------------------------------------------- region1 | 10,351 .2024925 .4018767 0 1 region2 | 10,351 .2679934 .4429356 0 1 region3 | 10,351 .2756255 .4468505 0 1 region4 | 10,351 .2538885 .4352556 0 1 smsa1 | 10,351 .2542749 .4354739 0 1 -------------+--------------------------------------------------------- smsa2 | 10,351 .2905999 .4540612 0 1 smsa3 | 10,351 .4551251 .4980062 0 1 rural | 0 loglead | 4,948 2.577758 .4115249 .6931472 4.382027 agegrp | 0 -------------+--------------------------------------------------------- highlead | 0 bmi | 10,351 25.5376 4.914969 12.3856 61.1297 highbp | 10,351 .4227611 .494022 0 1 .
%%stata
logistic highbp c.age##c.weight
*ereturn list
. . logistic highbp c.age##c.weight Logistic regression Number of obs = 10,351 LR chi2(3) = 2381.23 Prob > chi2 = 0.0000 Log likelihood = -5860.1512 Pseudo R2 = 0.1689 ------------------------------------------------------------------------------ highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | 1.108531 .0080697 14.15 0.000 1.092827 1.12446 weight | 1.081505 .005516 15.36 0.000 1.070748 1.092371 | c.age#| c.weight | .9992788 .0000977 -7.38 0.000 .9990873 .9994703 | _cons | .0002025 .0000787 -21.89 0.000 .0000946 .0004335 ------------------------------------------------------------------------------ Note: _cons estimates baseline odds. . . *ereturn list .
%%stata -eret myeret
logistic highbp c.age##c.weight
Logistic regression Number of obs = 10,351 LR chi2(3) = 2381.23 Prob > chi2 = 0.0000 Log likelihood = -5860.1512 Pseudo R2 = 0.1689 ------------------------------------------------------------------------------ highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | 1.108531 .0080697 14.15 0.000 1.092827 1.12446 weight | 1.081505 .005516 15.36 0.000 1.070748 1.092371 | c.age#| c.weight | .9992788 .0000977 -7.38 0.000 .9990873 .9994703 | _cons | .0002025 .0000787 -21.89 0.000 .0000946 .0004335 ------------------------------------------------------------------------------ Note: _cons estimates baseline odds.
myeret
{'e(rank)': 4.0, 'e(N)': 10351.0, 'e(ic)': 4.0, 'e(k)': 4.0, 'e(k_eq)': 1.0, 'e(noconstant)': 0.0, 'e(consonly)': 0.0, 'e(k_dv)': 1.0, 'e(converged)': 1.0, 'e(rc)': 0.0, 'e(k_autoCns)': 0.0, 'e(ll)': -5860.151215434775, 'e(k_eq_model)': 1.0, 'e(ll_0)': -7050.765484416371, 'e(df_m)': 3.0, 'e(chi2)': 2381.228537963192, 'e(p)': 0.0, 'e(N_cdf)': 0.0, 'e(N_cds)': 0.0, 'e(r2_p)': 0.1688631215451849, 'e(cmdline)': 'logistic highbp c.age##c.weight', 'e(cmd)': 'logistic', 'e(predict)': 'logistic_p', 'e(estat_cmd)': 'logit_estat', 'e(marginsderiv)': 'default Pr', 'e(marginsok)': 'default Pr', 'e(marginsnotok)': 'stdp DBeta DEviance DX2 DDeviance Hat Number Residuals RStandard SCore', 'e(title)': 'Logistic regression', 'e(chi2type)': 'LR', 'e(deriv_useminbound)': 'off', 'e(opt)': 'moptimize', 'e(vce)': 'oim', 'e(user)': 'mopt__logit_d2()', 'e(crittype)': 'log likelihood', 'e(ml_method)': 'd2', 'e(singularHmethod)': 'm-marquardt', 'e(technique)': 'nr', 'e(which)': 'max', 'e(depvar)': 'highbp', 'e(properties)': 'b V', 'e(b)': array([[ 1.03035512e-01, 7.83537336e-02, -7.21492364e-04, -8.50485074e+00]]), 'e(V)': array([[ 5.29930775e-05, 3.50509320e-05, -6.97861007e-07, -2.69423165e-03], [ 3.50509320e-05, 2.60132665e-05, -4.74084055e-07, -1.94299577e-03], [-6.97861007e-07, -4.74084055e-07, 9.55811843e-09, 3.50377702e-05], [-2.69423165e-03, -1.94299577e-03, 3.50377702e-05, 1.50887843e-01]]), 'e(mns)': array([[4.75796541e+01, 7.18975210e+01, 3.43111878e+03, 1.00000000e+00]]), 'e(rules)': array([[0., 0., 0., 0.]]), 'e(ilog)': array([[-7050.76548442, -5886.45946281, -5860.19509926, -5860.15121566, -5860.15121543, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]]), 'e(gradient)': array([[5.49611647e-06, 8.04060803e-06, 5.24302277e-04, 8.32616448e-08]])}
e_v = myeret['e(V)']
e_v
array([[ 5.29930775e-05, 3.50509320e-05, -6.97861007e-07, -2.69423165e-03], [ 3.50509320e-05, 2.60132665e-05, -4.74084055e-07, -1.94299577e-03], [-6.97861007e-07, -4.74084055e-07, 9.55811843e-09, 3.50377702e-05], [-2.69423165e-03, -1.94299577e-03, 3.50377702e-05, 1.50887843e-01]])
%%stata
quietly margins, at(age=(20(10)80))
*marginsplot
. . quietly margins, at(age=(20(10)80)) . . *marginsplot .
%%stata -doutd preddata
quietly margins, at(age=(20(5)80) weight=(40(5)180)) saving(predictions, replace)
use predictions, clear
list _at1 _at2 _margin in 1/5
rename _at1 age
rename _at2 weight
rename _margin pr_highbp
. . quietly margins, at(age=(20(5)80) weight=(40(5)180)) saving(predictions, repl > ace) . . use predictions, clear (Created by command margins; also see char list) . . list _at1 _at2 _margin in 1/5 +------------------------+ | _at1 _at2 _margin | |------------------------| 1. | 20 40 .0200911 | 2. | 20 45 .0274497 | 3. | 20 50 .0374008 | 4. | 20 55 .0507709 | 5. | 20 60 .0685801 | +------------------------+ . . rename _at1 age . . rename _at2 weight . . rename _margin pr_highbp .
import matplotlib.pyplot as plt
import numpy as np
# define the axes
fig = plt.figure(1, figsize=(10, 8))
ax = plt.axes(projection='3d')
# plot
ax.plot_trisurf(preddata['age'], preddata['weight'], preddata['pr_highbp'],cmap=plt.cm.Spectral_r)
# set ticks and labels for x, y, and z axes
ax.set_xticks(np.arange(20, 90, step=10))
ax.set_yticks(np.arange(40, 200, step=40))
ax.set_zticks(np.arange( 0, 1.2, step=0.2))
ax.set_xlabel("Age (years)")
ax.set_ylabel("Weight (kg)")
ax.set_zlabel("Probability of Hypertension")
# adjust the view angle
ax.view_init(elev=30, azim=240)
# show the plot
plt.show()