盧政良(Zheng-Liang Lu, Arthur); 聯絡方式:[email protected] ... Cantor set; 台大數學系陳俊全老師給了一個有趣的類比:Cantor set 就很像窮人的口袋。
Published
LinkedwithGitHub
Like8
Bookmark
Subscribe
---
tags:statistics
---
#資料科學入門(IntroductiontoDataScience)
原課程名:當統計學與程式相遇(LearningStatistics&Programming)
![](https://i2.wp.com/www.prosancons.com/wp-content/uploads/2018/04/041518_1320_Prosandcons1.png=400x)
"Thepurposeofcomputingisinsight,notnumbers."
--RichardW.Hamming
"Datadonotspeakforthemselves;
thereisalwaysaninterpreter,oratranslator."
--JohnW.Ratcliffe
"Rememberthatallmodelsarewrong;
thepracticalquestionishowwrongdotheyhavetobetonotbeuseful."
--GeorgeBox
"Itiseasytoliewithstatistics,buteasiertoliewithoutthem."
--FrederickMosteller
"Scienceismorethanabodyofknowledge;
itisawayofthinking.
Themethodofscience,asstodgyandgrumpyasitmayseem,
isfarmoreimportantthanthefindingsofscience."
--CarlSagan
###講者訊息
-盧政良(Zheng-LiangLu,Arthur)
-聯絡方式:[email protected]
###工作環境
-GoogleColabhttps://colab.research.google.com/
:::warning
本課程不限制程式語言,但課程將以[Python](https://hackmd.io/@arthurzllu/HJNXq84SO)作為示範;學員可使用Excel、R、或者MATLAB進行課程內容,惟須自行尋找對應的工具來完成問題。
:::
###預備知識
-四則運算與代數運算
-生活經驗與公民道德
-(Optional)微積分
-台大數學系朱樺老師[微積分](http://www.math.ntu.edu.tw/~hchu/Calculus/)
-3Blue1Brown,[EssenceofCalculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr)
![](https://i.imgur.com/1ZDNtsh.png=200x)
-(Optional)線性代數
-StephenH.Friedberg,ArnoldJ.Insel,LawrenceE.Spence,[LinearAlgebra](https://www.amazon.com/Linear-Algebra-5th-Stephen-Friedberg/dp/0134860241/),5/e,2018
![](https://i.imgur.com/Si0JvQb.png=100x)
-3Blue1Brown,[EssenceofLinearAlgebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)onGoogleYoutube
![](https://i.imgur.com/tnuPzSf.png=200x)
:::warning
本課程所牽涉到的數學,目前只需要了解其脈絡與結果,並依照個人的興趣與能力決定是否需要熟練推導或計算的細節;Python的套件中已經實現多數的數學結果,故繁瑣的計算可交付給電腦完成。
:::
###學習目標
*統計學
-了解統計工具與計算原理
-正確解釋統計結果
-合理預測資料的趨勢
-排除統計謬誤
*程式能力
-掌握資料處理流程
-學習創造自己的工具
![](https://i.imgur.com/nVXc2TN.png=250x)
###評分標準
####實體課程版
-期末專題成果發表
-重點項目:提問、資料收集與視覺化、模型假設、實驗結果、結論。
-分組原則:一人一組,報告以**投影片**或**jupyternotebook**進行。
-完成五次程式作業或完成期末專題報告的學員可獲頒本課程之證書。
-請將作業寄信至[email protected]並註明課程名稱與學員姓名。
####線上課程版
-完成五次程式作業的學員可以取得本課程之證書。
-作業繳交方式為將**jupyternotebook**上傳到[NTUCOOL](https://cool.ntu.edu.tw/)。
###授課對象
1.欲學習使用**統計方法**、**量化研究**的大專院校生、相關科研人員與工程師。
2.國高中生可,已學習過基礎統計學者佳(108課綱高二機率與統計I與高三機率與統計II)。
##主要參考書目
-StevenSSkiena,[TheDataScienceDesignManual](https://www.springer.com/gp/book/9783319554433),2017
![](https://i.imgur.com/JrFu6Hh.png=100x)
-LauraIgualandSantiSeguí,[IntroductiontoDataScience](https://link.springer.com/book/10.1007/978-3-319-50017-1),2017
![](https://hackmd.io/_uploads/Sk0GAENaq.png=100x)
-陳旭昇,[統計學:應用與進階](http://homepage.ntu.edu.tw/~sschen/Book/Book1.html),第三版,2015
![](https://hackmd.io/_uploads/Sym7Jc2eq.png=100x)
##課程大綱
0.Python程式基礎
1.資料擷取與視覺化
2.機率論導論與常見的機率模型
3.統計檢定
4.點估計與區間估計
5.大數法則與中央極限定理
6.回歸模型
7.時間序列分析
8.貝氏機率
9.機器學習簡介
10.(Optional)統計實務
![](https://i.imgur.com/fj7sI9u.png=400x)
##課程內容
###Python程式能力速成
-Python程式能力速成[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_0_python_programming.ipynb)
-資料型態與基礎運算
-有條件的敘述
-重複的敘述
-函式
-補充材料
-與程式能力相關的額外訊息[pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/cs_preliminary_knowledge.pdf)
-自學程式的app
-[LearnPython](https://play.google.com/store/apps/details?id=com.sololearn.python&hl=zh_TW&gl=US)
![](https://i.imgur.com/sMkTwsp.png=300x)
###資料擷取與視覺化
-Pandas[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_data_acquisition_visualization.ipynb)
-Pythondataanalysislibrary[link](https://pandas.pydata.org/)
-(FYR)https://www.kaggle.com/learn/pandas
-(FYR)Cheatsheet:[link1](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf),[link2](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf)
-資料預處理
-案例1:資料預處理[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_1.ipynb)
-案例2:金融時間序列[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_2_financial_time_series.ipynb)
-案例3:JSON檔案[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_3_json.ipynb)
-案例4:合併DataFrame[link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
-案例5:交叉表[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_cross_table_example.ipynb)
-(FYR)字串處理
-正規表示法(regularexpressions)
-互動式教學網站https://regexone.com/
-Python套件:https://docs.python.org/3/library/re.html
-(FYR)Pythonicdatacleaningwithnumpyandpandas[link](https://realpython.com/python-data-cleaning-numpy-pandas/)
-(FYR)https://www.kaggle.com/learn/data-cleaning
-資料視覺化
-Matplotlib官方文件[link](https://matplotlib.org/contents.html)
-(FYR)http://scipy-lectures.org/intro/matplotlib/index.html
-CheatsheetsbyDataCamp:[pdf](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
-一個不錯的教學文件[NicolasP.Rougier](http://www.labri.fr/perso/nrougier/teaching/matplotlib/)
-(FYR)https://www.kaggle.com/learn/data-visualization
:::success
++Lab1++使用Pandas
(1)計算每隻股票過去一年中日變化率超過$\pm9\%$的次數。
(2)按發生的次數排名這些資產。
(3)將排名的結果存檔為csv文件。
(4)繪製前3名贏家和前3名輸家的時間序列圖;通過使用rebase()對每個時間序列進行標準化來比較這些資產的相對表現(即算出損益)。
(5)製作散佈圖,觀察損益和發生次數(即日變化率超過$\pm9\%$的次數)之間是否存在某些關係。
(6)將散佈圖另存為pdf文件。
![](https://hackmd.io/_uploads/H1_hljhl9.png=350x)
[樣板](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_template.ipynb)/[參考解答](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_demo.ipynb)
:::
###機率論
-古典機率[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture1_ProbabilityModel_2019.pdf)
-一些重要的專有名詞:樣本空間、事件、機率公設、機率測度、條件機率、獨立事件
-機率等於零,代表不會發生?
-[Cantorset](https://en.wikipedia.org/wiki/Cantor_set)
-台大數學系陳俊全老師給了一個有趣的類比:Cantorset就很像窮人的口袋。
要吃飯要看病,口袋總是能夠拿得出一些錢;但是問身家財產是多少時,和為零。
-陳俊全老師語錄集:https://disp.cc/b/181-1wia
-隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture2_RandomVariable_2019.pdf)/[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_probability_models_and_random_number_generators.ipynb)
-離散隨機變數:白努利分配、二項式分配
-連續隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture5_Normal_2019.pdf):均勻分配、常態分配、$\chi^2$分配、Student'st分配、F分配
-可於SciPy的文件中找到已經實作的機率模型:
-https://docs.scipy.org/doc/scipy/reference/stats.html
-https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
-機率族譜[pic](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Relationships_among_some_of_univariate_probability_distributions.jpg/1920px-Relationships_among_some_of_univariate_probability_distributions.jpg)
-Poisson分配[pdf](https://sites.pitt.edu/~super7/19011-20001/19501.pdf)
-亂數生成(randomnumbergeneration,RNG)
-偽亂數(pseudorandomness)[link](https://en.wikipedia.org/wiki/Pseudorandomness)
-[random--Generatepseudo-randomnumbers](https://docs.python.org/3/library/random.html)
-[Inversetransformsampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling)forgeneratingsamplenumbersatrandomfromanyprobabilitydistributiongivenitscumulativedistributionfunction.
-期望值與多變量隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture4_Moments_2019.pdf)
-集中趨勢:算術平均數(arithmeticmean)、幾何平均數(geometricmean)、中位數(median)、眾數(mode)
-變異程度:變異數(variance)、標準差(standarddeviation)、全距(fullrange)
-高階動差:偏態(skewness)、峰態(kurtosis)
-正(負)偏態:平均值高(低)於中位數
-峰態>3:相較於常態分佈具有厚尾(heavytail)現象
-(FYR)動差生成函數(momentgeneratingfunction,mgf)[pdf](https://web.ma.utexas.edu/users/gordanz/notes/mgf_color.pdf)
-Taylorexpansion[wiki](https://en.wikipedia.org/wiki/Taylor_series)
-Whatis[Moment](https://www.merriam-webster.com/dictionary/moment)?
-共變異數與相關係數(covariance&correlation)
-Zerocorrelationimpliesindependence?
-條件期望值(conditionalexpectation/variance)
-LawofTotalVariance[wiki](https://en.wikipedia.org/wiki/Law_of_total_variance)
-獨立同分配(**iid**,**i**ndependentand**i**dentically**d**istributed)
![](https://i.imgur.com/HyJiGlg.png=500x)
:::success
++Lab2++甚麼是期望值?
假設隨機變數$Y$遵從下列的分佈:
![](https://i.imgur.com/rBFWcDO.png=300x)
則可知$\mathbb{E}(Y)=0.9$。
請寫一個程式模此分佈抽出的樣本,其樣本平均值會逼近期望值,當樣本大小從1到3000。
![](https://hackmd.io/_uploads/BJgpxCr-9.png=600x)
[DemoCode](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab2_demo.ipynb)
:::
###統計學框架
-抽樣方法與樣本分配[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture6_Sampling_2019.pdf)
-https://en.wikipedia.org/wiki/Sampling_(statistics)
-統計檢定[pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture8.pdf)
-關鍵字們:虛無/對立假設(null/alternativehypothesis)、p-value、顯著水準(significancelevel)、拒絕區(rejectingregion)、型一/二/三誤差(typeI/II/IIIerrors)
-SciPy上的案例[link](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/)
-額外閱讀的材料[pdf1](https://www2.isye.gatech.edu/~yxie77/isye2028/lecture8.pdf),[pdf2](http://www.sci.utah.edu/~arpaiva/classes/UT_ece3530/hypothesis_testing.pdf)
-案例:
-獨立檢定($\chi^2$independencetest)new
-[Lesson8:Chi-SquareTestforIndependence](https://online.stat.psu.edu/stat500/lesson/8),[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_chi_square_test_of_independence.ipynb)
-[SPSStutorials:Chi-SquareTestofIndependence](https://libguides.library.kent.edu/spss/chisquare)
![](https://imgs.xkcd.com/comics/null_hypothesis.png=190x)![](https://i.imgur.com/UOBvF5W.png=200x)
-線性迴歸(linearregression)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_3_linear_regression.ipynb)
-Python套件**statsmodels**[link](https://www.statsmodels.org/stable/regression.html)
-補充說明:
-InterpretingResultsfromLinearRegression–Isthedataappropriate?[link](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate)
-Abouterrorsandresiduals[wiki](https://en.wikipedia.org/wiki/Errors_and_residuals)
-常態分配檢定(normalitytests)[pdf](http://webspace.ship.edu/pgmarr/Geo441/Lectures/Lec%205%20-%20Normality%20Testing.pdf)
-(FYR)Seier(2014):[NormalityTests:PowerComparison](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_421)
-(FYR)Jarque(2014):[Jarque-BeraTest](https://link.springer.com/referenceworkentry/10.1007/978-3-642-04898-2_319)proposedbyJarqueandBera(1980).
-(FYR)BowmanandShenton(2014):[OmnibusTest](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_426)proposedbyD’Agostino(1973).
-Howtodetectthemulticollinearity?
-[Varianceinflationfactor](https://en.wikipedia.org/wiki/Variance_inflation_factor)(VIF)
-Howtodetecttheheteroscedasticity?
-Weightedleast-square(WLS)method,oneofGeneralizedleast-square(GLS)method
-更多案例:
-Buffett'salphabyAQRCapitalManagement[link](https://www.aqr.com/Insights/Research/Journal-Article/Buffetts-Alpha)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Buffett_Alpha.pdf)[方格子解說Buffett'salpha](https://vocus.cc/tarcy2801/5d6fc9fdfd89780001ca4e42)
-更多關於類別(categorical)資料的迴歸[link](https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/)
-Datatransformations[link](http://www.biostathandbook.com/transformation.html)
-Logtransformation:forsizedata
-Square-roottransformation:forcountdata
-Arcsinetransformation
-[RamseyRESETtest](https://en.wikipedia.org/wiki/Ramsey_RESET_test)
>Iftheproposedmodelisadequate,thenthestandardizedresidualsshouldbeawhitenoise.
>Ittestswhethernon-linearcombinationsofthefittedvalueshelpexplaintheresponsevariable.
![](https://imgs.xkcd.com/comics/linear_regression_2x.png=400x)
:::success
++Lab3++簡單線性回歸
TBA
![](https://hackmd.io/_uploads/SJegQSkGq.png=400x)
:::
-參數估計
-點估計(pointestimation)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture8_PointEst_2019.pdf)
-方法
-類比法
-動差法(methodofmoments)
-最大概似估計(maximumlikelihoodestimation,MLE)
-好的估計式至少具備三個性質:
-無偏(unbiased)
-有效率(efficient)
-一致(consistent)
-最佳線性無偏估計式(bestLinearunbiasedestimator,BLUE)
-Gauss-MarkovTheorem
>...theordinaryleastsquares(OLS)estimatorhasthelowestsamplingvariancewithintheclassoflinearunbiasedestimators,iftheerrorsinthelinearregressionmodelareuncorrelated,haveequalvariancesandexpectationvalueofzero.
-充分統計量(sufficientstatistic)與最小變異數不偏估計(uniformlyminimum-varianceunbiasedestimator,UMVUE)
-詳見數理統計。
-區間估計(intervalestimation)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture9_IntervalEst_2019.pdf)
-什麼是95%的信賴區間(confidenceinterval)?
![](https://i.imgur.com/BXhjfPu.png=400x)
-變異數分析(analysisofvariance,ANOVA)[pdf](http://amath2.nchu.edu.tw/honda/605Lecture/Lecture10_ANOVA.pdf)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_4_ANOVA_example.ipynb)
-Whynott-test?[link](https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/)
>Anothermeasuretocomparethesamplesiscalledat-test.Whenwehaveonlytwosamples,t-testandANOVAgivethesameresults.However,usingat-testwouldnotbereliableincaseswheretherearemorethan2samples.Ifweconductmultiplet-testsforcomparingmorethantwosamples,itwillhavea**confoundingeffect**ontheerrorrateoftheresult.
-Confoundingeffect:https://www.scribbr.com/methodology/confounding-variables/
-更多案例們:
-One-wayANOVA:https://www.pythonfordatascience.org/anova-python/
-Two-wayANOVA:http://www.pybloggers.com/2016/03/three-ways-to-do-a-two-way-anova-with-python/
-Designofexperiments(DoE)
-三大基本原則:randomization、replication、blocking
-去除干擾變數對反映變數的影響:
-未知且不可控:fullrandomization
-已知但不可控:analysisofcovariance(ANCOVA)
-已知且可控:blocking(oneofmethodforlocalcontroltolowerSSEandincreaseprecision)
-Latinsquaredesign,LSD
![](https://i.imgur.com/pp5QR5A.png=400x)
-漸進理論[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture7_Asymptotics_2019.pdf)
-收斂性(convergence)
-大數法則(LawofLargeNumber,LLN)
-中央極限定理(CentralLimitingTheorem,CLT)
>Thefactthatsamplingdistributionscanapproximateanormaldistributionhascriticalimplications.Instatistics,thenormalityassumptionisvitalforparametrichypothesistestsofthemean,suchasthet-test.Consequently,youmightthinkthatthesetestsarenotvalidwhenthedataarenonnormallydistributed.However,ifyoursamplesizeislargeenough,thecentrallimittheoremkicksinandproducessamplingdistributionsthatapproximateanormaldistribution.Thisfactallowsyoutousethesehypothesistestsevenwhenyourdataarenonnormallydistributed—aslongasyoursamplesizeislargeenough.See[link](https://statisticsbyjim.com/basics/central-limit-theorem/).
-補充材料:
-https://python.quantecon.org/lln_clt.html
:::success
++Lab4++檢驗中央極限定理
撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本,找出最小的樣本大小使其樣本分佈不被常態檢定(normalitytest)拒絕:
1.標準均勻分配
2.卡方分配($\chi^2$distributionwithdf=3)
3.Poisson分配(Poissondistributionwith$\mu=3$)
4.柯西分配(thestandardCauchyDistribution)
![](https://i.imgur.com/KlEobFi.png=300x)![](https://i.imgur.com/X6D5TA9.png=300x)
![](https://i.imgur.com/U6B6bFG.png=300x)![](https://i.imgur.com/ShQeTfB.png=300x)
[DemoCode](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab4_demo.ipynb)
:::
![](https://i.imgur.com/EZwESD1.png=250x)
-時間序列分析(timeseriesanalysis)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_4_time_series.ipynb)
-自相關性(autocorrelation)
-平穩性質(stationariness)與單根檢定(unitroottest)
-自回歸模型(autoregressivemodel,AR)
-移動平均模型(movingaveragemodel,MA)
-ARMA$(p,q)$與ARIMA$(p,d,q)$模型
-貝氏機率(Bayesianprobability)[pdf](http://pillowlab.princeton.edu/teaching/mathtools16/slides/lec13_BayesRule.pdf)
-MarcGarcia,[Bayesianinferencetutorial:ahelloworldexample](https://datapythonista.me/blog/bayesian-inference-tutorial-a-hello-world-example.html),2020
-SamuelHinton,[BayesianLinearRegressioninPython](https://cosmiccoding.com.au/tutorials/bayes_lin_reg),2019
-參考材料:
-(FYR)從經驗中學習-直觀理解貝氏定理及其應用[link](https://leemeng.tw/intuitive-understandind-of-bayes-rules-and-learn-from-experience.html)
-(FYR)別再瞎猜、靠運氣!NASA、微軟都在用「貝式理論」做決策[link](https://buzzorange.com/techorange/2019/07/24/nasa-how-to-make-right-decision/)
-(FYR)Chapter12:BayesianInference,StatisticalMachineLearning,CMU[pdf](http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf)
-[IntroductiontoBayesianModelingwithPyMC3](https://juanitorduz.github.io/intro_pymc3/)
-[Bayes’RuleWithPython](http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithPython.pdf)
-[MontyHallProblem](https://en.wikipedia.org/wiki/Monty_Hall_problem)
-https://www.astronomy.swin.edu.au/~cblake/StatsLecture4.pdf
-https://astrostatistics.psu.edu/RLectures/IntroBayes-1.pdf
-https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_bayesiandecision.pdf
-[貝氏統計學的概念.pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/貝氏統計學的概念.pdf)
###機器學習導論
-(FYR)DeepMind:ADocumentaryFile[youtube](https://www.youtube.com/watch?v=WXuK6gekU1Y)
-回歸分析(regression)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_5_machine_learning_tutorial.ipynb)
-Ridgeregression
-LASSOregression
-Logisticregression
-支持向量機(supportvectormachine,SVM)
-決策樹(decisiontree)與隨機森林(randomforest)
-主成分分析(principalcomponentanalysis,PCA)
-https://setosa.io/ev/principal-component-analysis/
-K-meansclustering
-增強式學習(reinforcementlearning):Q-Learning
-深度學習(deeplearning)
-https://www.kaggle.com/learn/intro-to-deep-learning
-案例學習:JackyHsueh,[為什麼需要經濟理論來預測經濟趨勢:比較機器學習與計量經濟](http://economicsnote.com/%E7%82%BA%E4%BB%80%E9%BA%BC%E9%9C%80%E8%A6%81%E7%B6%93%E6%BF%9F%E7%90%86%E8%AB%96%E4%BE%86%E9%A0%90%E6%B8%AC%E7%B6%93%E6%BF%9F%E8%B6%A8%E5%8B%A2%E6%AF%94%E8%BC%83%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92/),2021.2.26
-[Vapnik–Chervonenkisdimension](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension)
-[Receiveroperatingcharacteristic(ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
![](https://i.imgur.com/PY21A4V.png=350x)
:::success
++Lab5++K-Means演算法實作
撰寫一個程式實現K-Means演算法。
[樣板程式](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab5_template.ipynb)內已經能夠產生測試用的資料,如下方左圖所示。
K-Means的基本精神在於透過**距離**的遠近來歸納組別。
演算法步驟可以參考此[連結](https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm_(naive_k-means))。
該演算法分群的結果如下方右圖所示,其中紅色菱形的符號代表該群的算術中心點。
我的目標是希望學員可以實現基本的K-Means演算法。
注意,分群結果(右圖)沒有保證會跟正確答案(左圖)相同,故本程式的重點是演算法的實作。
![](https://i.imgur.com/329c7JV.png=700x)
:::
###統計實務
-無母數分析
-等級相關
-Spearman等級相關係數
-Kendall等級相關係數
-單一母體
-符號檢定(signtest)
-Wilcoxcon符號等級檢定
-兩相依母體
-配對符號檢定
-Wilcoxcon配對符號等級檢定
-兩獨立母體
-Wilcoxon等級和檢定
-Mann-WhitneyU檢定
-多獨立母體
-Kruskal-Wallis檢定
-多相依母體
-Friedman檢定
-隨機性檢定
-連檢定
-核密度函數估計(Kerneldensityestimation,KDE)[link](https://scikit-learn.org/stable/modules/density.html)
![](https://i.imgur.com/ATECGO7.png=400x)
-小樣本分析(small-sampleanalysis)
-Fishertest[link](http://blog.pulipuli.info/2017/05/fishers-exact-test-example.html)
![](https://i.imgur.com/JyiaDyG.jpg=250x)
-雙峰/多峰分佈(bimodal/multimodaldistribution)
-https://en.wikipedia.org/wiki/Mixture_model
-極值理論(Extremevaluetheory,EVT)
-https://en.wikipedia.org/wiki/Heavy-tailed_distribution
-https://en.wikipedia.org/wiki/Extreme_value_theory
##候選題目
-華人有冬天進補的文化,若今年冬天溫度特別的低,請問是否會影響到食品類股的價格上揚?
-MarkovChainMonteCarlo(MCMC)
-當沖金額佔當日交易金額的比例增加時,是否意味著行情即將轉空?
-流行病學模型https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology
-到底是缺電?還是超用?
-挖礦的電力消耗占比?
-時間序列版本的$R^2$。
-$R^2$為一個beta分佈,開n次方之後為一個常態分佈。
-布朗運動
-布朗橋(Brownianbridge)
-Arcsinelaw:https://scipython.com/blog/the-arcsine-law/
-WarinUkraine:https://ourworldindata.org/ukraine-war
##資料來源
###台灣政府公開資料
-政府開放資料中心:https://data.gov.tw/
-臺北市資料大平臺:https://data.taipei/
-中央氣象局公開資料:https://opendata.cwb.gov.tw/dataset/observation
-薪情體驗:https://earnings.dgbas.gov.tw/experience_sub_01.aspx
-https://www.numbeo.com/cost-of-living/
-國家發展委員會人口推估查詢系統:https://pop-proj.ndc.gov.tw/index.aspx
-內政部統計處:https://www.moi.gov.tw/stat/index.aspx
-內政部不動產交易實價查詢:https://lvr.land.moi.gov.tw/homePage.action
-用程式分析房地產可行嗎?房價分析看這裡!by[FinLab](https://www.finlab.tw/real-estate-analasys-histograms/)
-文化部資料開放服務網https://opendata.culture.tw
-台灣電力公司https://www.taipower.com.tw/tc/index.aspx
-政府資料開放平臺資料集清單-台灣電力股份有限公司[link](https://sheethub.com/data.gov.tw/%E6%94%BF%E5%BA%9C%E8%B3%87%E6%96%99%E9%96%8B%E6%94%BE%E5%B9%B3%E8%87%BA%E8%B3%87%E6%96%99%E9%9B%86%E6%B8%85%E5%96%AE/i/44/%E5%8F%B0%E7%81%A3%E9%9B%BB%E5%8A%9B%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8)
-彩券相關
-超讚的樂透網:https://zan01.com/
-樂透堂:http://www.9800.com.tw/
###國外公開資料來源
-U.S.CensusBureau:https://www.census.gov/
-WorldBank:https://data.worldbank.org/
-NASA:https://nasa.github.io/data-nasa-gov-frontpage/data_visualizations.html
-DataWorld:https://data.world/
-HumanDevelopmentReports:http://www.hdr.undp.org/en
-SportsReference:https://www.baseball-reference.com/
-DatabankofBankofEngland:https://www.bankofengland.co.uk/statistics
###分析平台
-GoogleDataStudio:https://datastudio.google.com/
###競賽平台
-https://www.kaggle.com/datasets/
-https://zindi.africa/competitions
##參考資料
###書籍
####曾經使用過的教科書
-ThomasHaslwanter,[AnIntroductiontoStatisticswithPython](https://www.springer.com/us/book/9783319283159),2016可在台大校園IP範圍內進行下載!
![](https://i.imgur.com/ypOd5fQ.png=100x)
-JoséUnpingco,[PythonforProbability,Statistics,andMachineLearning](https://link.springer.com/book/10.1007%2F978-3-319-30717-6),2/e
[link](https://link.springer.com/book/10.1007%2F978-3-030-18545-9)可在台大校園IP範圍內進行下載!
![](https://i.imgur.com/vdfeIre.png=100x)
-JakeVanderPlas,[PythonDataScienceHandbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/),2016[online](https://github.com/jakevdp/PythonDataScienceHandbook)[github](https://github.com/jakevdp/PythonDataScienceHandbook)
![](https://i.imgur.com/Bdf0PTa.png=100x)
####機率論
-SheldonRoss,[IntroductiontoProbabilityModels](https://www.elsevier.com/books/introduction-to-probability-models/ross/978-0-12-814346-9),12/e,2019
![](https://i.imgur.com/57W4fGF.jpg=100x)
####數理統計
-RobertV.Hogg,JosephW.McKean,andAllenT.Craig,[IntroductiontoMathematicalStatistics](https://www.amazon.com/-/zh_TW/dp/0134686993/),8/e,2019
![](https://i.imgur.com/2juN4uq.png=100x)
-GeorgeCasellaandRogerL.Berger,[StatisticalInference](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126),2/e,2001
![](https://i.imgur.com/D9fxPRb.png=100x)
####實驗設計
-DouglasC.Montgomery,[DesignandAnalysisofExperiments](https://www.wiley.com/en-gb/Design+and+Analysis+of+Experiments,+9th+Edition-p-9781119320937),9/e,2017
![](https://i.imgur.com/KH0xEM8.jpg=100x)
-AngelaDean,DanielVoss,andDanelDraguljić,[DesignandAnalysisofExperiments](https://link.springer.com/book/10.1007/978-3-319-52250-0),2017
![](https://i.imgur.com/5JKXVjP.jpg=100x)
####統計學通論
-BarbaraBlatchley,[StatisticsinContext](https://www.amazon.com/Statistics-Context-Barbara-Blatchley/dp/0190278951),2018
![](https://i.imgur.com/MFDwoYZ.png=100x)
-張翔與廖崇智,++提綱挈領學統計++,第八版,2019/6/14
![](https://i.imgur.com/DyXbeFs.png=130x)
-許誠哲,++統計學:重點觀念與題解++,2018/3/1
![](https://i.imgur.com/BdkBLQ0.png=90x)![](https://i.imgur.com/2w0llwW.png=90x)
-DavidM.Laneandetc,++OnlineStatisticsEducation++:http://onlinestatbook.com/Online_Statistics_Education.pdf
####時間序列
-陳旭昇,[時間序列分析-總體經濟與財務金融之應用](http://homepage.ntu.edu.tw/~sschen/Book/Book2.htm),第二版
![](https://i.imgur.com/b2Ubehb.png=100x)
-++IntroductiontoEconometricswithR++:https://www.econometrics-with-r.org/index.html
####機器學習
-GarethJames,DanielaWitten,TrevorHastie,andRobertTibshirani,[AnIntroductiontoStatisticalLearningwithApplicationsinR](https://link.springer.com/book/10.1007/978-1-4614-7138-7),2013
![](https://i.imgur.com/sCY7WYs.png=100x)
-TrevorHastie,RobertTibshirani,andJeromeFriedman,[TheElementsofStatisticalLearning:DataMining,Inference,andPrediction](https://link.springer.com/book/10.1007/978-0-387-84858-7),2009
![](https://i.imgur.com/u1Mi1mt.png=100x)
-OvidiuCalin,[DeepLearningArchitectures](https://link.springer.com/book/10.1007/978-3-030-36721-3),2020
![](https://hackmd.io/_uploads/rkX2aVN6q.png=100x)
###科學普及閱讀
-HistoryofStatistics:https://www.york.ac.uk/depts/maths/histstat/
![](https://i.imgur.com/UzJP2o3.png=100x)
-安德魯·維克斯,++34個讓你豁然開朗的統計學小故事++,2019/03/28
![](https://i.imgur.com/QYCD6Fo.png=140x)
-羅伯特·艾貝爾森,++一位耶魯大學教授的統計箴言++,2019/05/28
![](https://i.imgur.com/okSuBrz.png=100x)
-塚本邦尊...等,++東京大學資料科學家養成全書:使用Python動手學習資料分++
![](https://i.imgur.com/FebESuA.png=150x)
###國外內課程
-Prof.Shiu-ShengChen(http://homepage.ntu.edu.tw/~sschen/)
-Dr.Shao-WeiCheng(http://www.stat.nthu.edu.tw/~swcheng/)
-++Statistics++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2820/index.php
-++Probability++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2810/index.php
-++ExperimentalDesignandAnalysis++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5510/index.php
-++MathematicalStatistics++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat3875/index.php
-++DiscreteAnalysis++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/index.html
-++STA-663-2017++:http://people.duke.edu/~ccc14/sta-663-2017/
-Dr.PeterKempthorne,++LecturenotesonProbabilityandStatistics++:http://users.encs.concordia.ca/~doedel/courses/comp-233/slides.pdf
-++MathematicalStatistic++s:https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/
-EmmanuelCandès,https://statweb.stanford.edu/~candes/
-++TheoryofStatistics++:https://statweb.stanford.edu/~candes/teaching/stats300c/
-++ModernMarkovChain++:https://statweb.stanford.edu/~candes/teaching/stats318/
-OlegMelnikov,++IntroductiontoStatisticalInference++:http://stats200.stanford.edu/
-LiamPaninski,++ComputationalStatistics++:http://www.stat.columbia.edu/~liam/teaching/compstat-spr19/
-DavidAldous,++ProbabilityandtheRealWorld++:https://www.stat.berkeley.edu/~aldous/157
-BochengJing,++Sta102-IntroBiostatistics++:https://www2.stat.duke.edu/courses/Spring13/sta102.001/
###雜項
-Secretaryproblem
-https://zh.wikipedia.org/wiki/%E7%A7%98%E6%9B%B8%E5%95%8F%E9%A1%8C
-https://style.udn.com/style/story/8073/1452739
-http://www.statslife.org.uk/images/pdf/timeline-of-statistics.pdf
-王超辰:醫學統計學https://bookdown.org/ccwang/medical_statistics6/
-IoaneMuniToke,++AnIntroductiontoHawkesProcesseswithApplicationstoFinance++,2011:http://lamp.ecp.fr/MAS/fiQuant/ioane_files/HawkesCourseSlides.pdf
-https://ourworldindata.org/grapher/annual-working-hours-vs-gdp-per-capita-pwt
-https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf
8
×
Signin
Email
Password
Forgotpassword
or
Byclickingbelow,youagreetoourtermsofservice.
SigninviaFacebook
SigninviaTwitter
SigninviaGitHub
SigninviaDropbox
SigninviaGoogle
NewtoHackMD?Signup