資料科學入門(Introduction to Data Science) - HackMD

文章推薦指數: 80 %
投票人數:10人

盧政良(Zheng-Liang Lu, Arthur); 聯絡方式:[email protected] ... Cantor set; 台大數學系陳俊全老師給了一個有趣的類比:Cantor set 就很像窮人的口袋。

      Published LinkedwithGitHub Like8 Bookmark Subscribe --- tags:statistics --- #資料科學入門(IntroductiontoDataScience) 原課程名:當統計學與程式相遇(LearningStatistics&Programming)

![](https://i2.wp.com/www.prosancons.com/wp-content/uploads/2018/04/041518_1320_Prosandcons1.png=400x)
"Thepurposeofcomputingisinsight,notnumbers."
--RichardW.Hamming

"Datadonotspeakforthemselves;
thereisalwaysaninterpreter,oratranslator."
--JohnW.Ratcliffe "Rememberthatallmodelsarewrong;
thepracticalquestionishowwrongdotheyhavetobetonotbeuseful."
--GeorgeBox "Itiseasytoliewithstatistics,buteasiertoliewithoutthem."
--FrederickMosteller "Scienceismorethanabodyofknowledge;
itisawayofthinking.
Themethodofscience,asstodgyandgrumpyasitmayseem,
isfarmoreimportantthanthefindingsofscience."
--CarlSagan ###講者訊息 -盧政良(Zheng-LiangLu,Arthur) -聯絡方式:[email protected] ###工作環境 -GoogleColabhttps://colab.research.google.com/ :::warning 本課程不限制程式語言,但課程將以[Python](https://hackmd.io/@arthurzllu/HJNXq84SO)作為示範;學員可使用Excel、R、或者MATLAB進行課程內容,惟須自行尋找對應的工具來完成問題。

::: ###預備知識 -四則運算與代數運算 -生活經驗與公民道德 -(Optional)微積分 -台大數學系朱樺老師[微積分](http://www.math.ntu.edu.tw/~hchu/Calculus/) -3Blue1Brown,[EssenceofCalculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) ![](https://i.imgur.com/1ZDNtsh.png=200x) -(Optional)線性代數 -StephenH.Friedberg,ArnoldJ.Insel,LawrenceE.Spence,[LinearAlgebra](https://www.amazon.com/Linear-Algebra-5th-Stephen-Friedberg/dp/0134860241/),5/e,2018 ![](https://i.imgur.com/Si0JvQb.png=100x) -3Blue1Brown,[EssenceofLinearAlgebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)onGoogleYoutube ![](https://i.imgur.com/tnuPzSf.png=200x) :::warning 本課程所牽涉到的數學,目前只需要了解其脈絡與結果,並依照個人的興趣與能力決定是否需要熟練推導或計算的細節;Python的套件中已經實現多數的數學結果,故繁瑣的計算可交付給電腦完成。

::: ###學習目標 *統計學 -了解統計工具與計算原理 -正確解釋統計結果 -合理預測資料的趨勢 -排除統計謬誤 *程式能力 -掌握資料處理流程 -學習創造自己的工具
![](https://i.imgur.com/nVXc2TN.png=250x)
###評分標準 ####實體課程版 -期末專題成果發表 -重點項目:提問、資料收集與視覺化、模型假設、實驗結果、結論。

-分組原則:一人一組,報告以**投影片**或**jupyternotebook**進行。

-完成五次程式作業完成期末專題報告的學員可獲頒本課程之證書。

-請將作業寄信至[email protected]並註明課程名稱與學員姓名。

####線上課程版 -完成五次程式作業的學員可以取得本課程之證書。

-作業繳交方式為將**jupyternotebook**上傳到[NTUCOOL](https://cool.ntu.edu.tw/)。

###授課對象 1.欲學習使用**統計方法**、**量化研究**的大專院校生、相關科研人員與工程師。

2.國高中生可,已學習過基礎統計學者佳(108課綱高二機率與統計I與高三機率與統計II)。

##主要參考書目 -StevenSSkiena,[TheDataScienceDesignManual](https://www.springer.com/gp/book/9783319554433),2017 ![](https://i.imgur.com/JrFu6Hh.png=100x) -LauraIgualandSantiSeguí,[IntroductiontoDataScience](https://link.springer.com/book/10.1007/978-3-319-50017-1),2017 ![](https://hackmd.io/_uploads/Sk0GAENaq.png=100x) -陳旭昇,[統計學:應用與進階](http://homepage.ntu.edu.tw/~sschen/Book/Book1.html),第三版,2015 ![](https://hackmd.io/_uploads/Sym7Jc2eq.png=100x) ##課程大綱 0.Python程式基礎 1.資料擷取與視覺化 2.機率論導論與常見的機率模型 3.統計檢定 4.點估計與區間估計 5.大數法則與中央極限定理 6.回歸模型 7.時間序列分析 8.貝氏機率 9.機器學習簡介 10.(Optional)統計實務
![](https://i.imgur.com/fj7sI9u.png=400x)
##課程內容 ###Python程式能力速成 -Python程式能力速成[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_0_python_programming.ipynb) -資料型態與基礎運算 -有條件的敘述 -重複的敘述 -函式 -補充材料 -與程式能力相關的額外訊息[pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/cs_preliminary_knowledge.pdf) -自學程式的app -[LearnPython](https://play.google.com/store/apps/details?id=com.sololearn.python&hl=zh_TW&gl=US)
![](https://i.imgur.com/sMkTwsp.png=300x)
###資料擷取與視覺化 -Pandas[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_data_acquisition_visualization.ipynb) -Pythondataanalysislibrary[link](https://pandas.pydata.org/) -(FYR)https://www.kaggle.com/learn/pandas -(FYR)Cheatsheet:[link1](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf),[link2](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf) -資料預處理 -案例1:資料預處理[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_1.ipynb) -案例2:金融時間序列[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_2_financial_time_series.ipynb) -案例3:JSON檔案[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_3_json.ipynb) -案例4:合併DataFrame[link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) -案例5:交叉表[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_cross_table_example.ipynb) -(FYR)字串處理 -正規表示法(regularexpressions) -互動式教學網站https://regexone.com/ -Python套件:https://docs.python.org/3/library/re.html -(FYR)Pythonicdatacleaningwithnumpyandpandas[link](https://realpython.com/python-data-cleaning-numpy-pandas/) -(FYR)https://www.kaggle.com/learn/data-cleaning -資料視覺化 -Matplotlib官方文件[link](https://matplotlib.org/contents.html) -(FYR)http://scipy-lectures.org/intro/matplotlib/index.html -CheatsheetsbyDataCamp:[pdf](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf) -一個不錯的教學文件[NicolasP.Rougier](http://www.labri.fr/perso/nrougier/teaching/matplotlib/) -(FYR)https://www.kaggle.com/learn/data-visualization :::success ++Lab1++使用Pandas (1)計算每隻股票過去一年中日變化率超過$\pm9\%$的次數。

(2)按發生的次數排名這些資產。

(3)將排名的結果存檔為csv文件。

(4)繪製前3名贏家和前3名輸家的時間序列圖;通過使用rebase()對每個時間序列進行標準化來比較這些資產的相對表現(即算出損益)。

(5)製作散佈圖,觀察損益和發生次數(即日變化率超過$\pm9\%$的次數)之間是否存在某些關係。

(6)將散佈圖另存為pdf文件。

![](https://hackmd.io/_uploads/H1_hljhl9.png=350x) [樣板](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_template.ipynb)/[參考解答](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_demo.ipynb)
::: ###機率論 -古典機率[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture1_ProbabilityModel_2019.pdf) -一些重要的專有名詞:樣本空間、事件、機率公設、機率測度、條件機率、獨立事件 -機率等於零,代表不會發生? -[Cantorset](https://en.wikipedia.org/wiki/Cantor_set) -台大數學系陳俊全老師給了一個有趣的類比:Cantorset就很像窮人的口袋。

要吃飯要看病,口袋總是能夠拿得出一些錢;但是問身家財產是多少時,和為零。

-陳俊全老師語錄集:https://disp.cc/b/181-1wia -隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture2_RandomVariable_2019.pdf)/[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_probability_models_and_random_number_generators.ipynb) -離散隨機變數:白努利分配、二項式分配 -連續隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture5_Normal_2019.pdf):均勻分配、常態分配、$\chi^2$分配、Student'st分配、F分配 -可於SciPy的文件中找到已經實作的機率模型: -https://docs.scipy.org/doc/scipy/reference/stats.html -https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html -機率族譜[pic](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Relationships_among_some_of_univariate_probability_distributions.jpg/1920px-Relationships_among_some_of_univariate_probability_distributions.jpg) -Poisson分配[pdf](https://sites.pitt.edu/~super7/19011-20001/19501.pdf) -亂數生成(randomnumbergeneration,RNG) -偽亂數(pseudorandomness)[link](https://en.wikipedia.org/wiki/Pseudorandomness) -[random--Generatepseudo-randomnumbers](https://docs.python.org/3/library/random.html) -[Inversetransformsampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling)forgeneratingsamplenumbersatrandomfromanyprobabilitydistributiongivenitscumulativedistributionfunction. -期望值與多變量隨機變數[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture4_Moments_2019.pdf) -集中趨勢:算術平均數(arithmeticmean)、幾何平均數(geometricmean)、中位數(median)、眾數(mode) -變異程度:變異數(variance)、標準差(standarddeviation)、全距(fullrange) -高階動差:偏態(skewness)、峰態(kurtosis) -正(負)偏態:平均值高(低)於中位數 -峰態>3:相較於常態分佈具有厚尾(heavytail)現象 -(FYR)動差生成函數(momentgeneratingfunction,mgf)[pdf](https://web.ma.utexas.edu/users/gordanz/notes/mgf_color.pdf) -Taylorexpansion[wiki](https://en.wikipedia.org/wiki/Taylor_series) -Whatis[Moment](https://www.merriam-webster.com/dictionary/moment)? -共變異數與相關係數(covariance&correlation) -Zerocorrelationimpliesindependence? -條件期望值(conditionalexpectation/variance) -LawofTotalVariance[wiki](https://en.wikipedia.org/wiki/Law_of_total_variance) -獨立同分配(**iid**,**i**ndependentand**i**dentically**d**istributed)
![](https://i.imgur.com/HyJiGlg.png=500x)
:::success ++Lab2++甚麼是期望值? 假設隨機變數$Y$遵從下列的分佈:
![](https://i.imgur.com/rBFWcDO.png=300x)
則可知$\mathbb{E}(Y)=0.9$。

請寫一個程式模此分佈抽出的樣本,其樣本平均值會逼近期望值,當樣本大小從1到3000。

![](https://hackmd.io/_uploads/BJgpxCr-9.png=600x) [DemoCode](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab2_demo.ipynb)
::: ###統計學框架 -抽樣方法與樣本分配[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture6_Sampling_2019.pdf) -https://en.wikipedia.org/wiki/Sampling_(statistics) -統計檢定[pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture8.pdf) -關鍵字們:虛無/對立假設(null/alternativehypothesis)、p-value、顯著水準(significancelevel)、拒絕區(rejectingregion)、型一/二/三誤差(typeI/II/IIIerrors) -SciPy上的案例[link](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) -額外閱讀的材料[pdf1](https://www2.isye.gatech.edu/~yxie77/isye2028/lecture8.pdf),[pdf2](http://www.sci.utah.edu/~arpaiva/classes/UT_ece3530/hypothesis_testing.pdf) -案例: -獨立檢定($\chi^2$independencetest)new -[Lesson8:Chi-SquareTestforIndependence](https://online.stat.psu.edu/stat500/lesson/8),[code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_chi_square_test_of_independence.ipynb) -[SPSStutorials:Chi-SquareTestofIndependence](https://libguides.library.kent.edu/spss/chisquare)
![](https://imgs.xkcd.com/comics/null_hypothesis.png=190x)![](https://i.imgur.com/UOBvF5W.png=200x)
-線性迴歸(linearregression)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_3_linear_regression.ipynb) -Python套件**statsmodels**[link](https://www.statsmodels.org/stable/regression.html) -補充說明: -InterpretingResultsfromLinearRegression–Isthedataappropriate?[link](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate) -Abouterrorsandresiduals[wiki](https://en.wikipedia.org/wiki/Errors_and_residuals) -常態分配檢定(normalitytests)[pdf](http://webspace.ship.edu/pgmarr/Geo441/Lectures/Lec%205%20-%20Normality%20Testing.pdf) -(FYR)Seier(2014):[NormalityTests:PowerComparison](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_421) -(FYR)Jarque(2014):[Jarque-BeraTest](https://link.springer.com/referenceworkentry/10.1007/978-3-642-04898-2_319)proposedbyJarqueandBera(1980). -(FYR)BowmanandShenton(2014):[OmnibusTest](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_426)proposedbyD’Agostino(1973). -Howtodetectthemulticollinearity? -[Varianceinflationfactor](https://en.wikipedia.org/wiki/Variance_inflation_factor)(VIF) -Howtodetecttheheteroscedasticity? -Weightedleast-square(WLS)method,oneofGeneralizedleast-square(GLS)method -更多案例: -Buffett'salphabyAQRCapitalManagement[link](https://www.aqr.com/Insights/Research/Journal-Article/Buffetts-Alpha)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Buffett_Alpha.pdf)[方格子解說Buffett'salpha](https://vocus.cc/tarcy2801/5d6fc9fdfd89780001ca4e42) -更多關於類別(categorical)資料的迴歸[link](https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/) -Datatransformations[link](http://www.biostathandbook.com/transformation.html) -Logtransformation:forsizedata -Square-roottransformation:forcountdata -Arcsinetransformation -[RamseyRESETtest](https://en.wikipedia.org/wiki/Ramsey_RESET_test) >Iftheproposedmodelisadequate,thenthestandardizedresidualsshouldbeawhitenoise. >Ittestswhethernon-linearcombinationsofthefittedvalueshelpexplaintheresponsevariable.
![](https://imgs.xkcd.com/comics/linear_regression_2x.png=400x)
:::success ++Lab3++簡單線性回歸 TBA
![](https://hackmd.io/_uploads/SJegQSkGq.png=400x)
::: -參數估計 -點估計(pointestimation)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture8_PointEst_2019.pdf) -方法 -類比法 -動差法(methodofmoments) -最大概似估計(maximumlikelihoodestimation,MLE) -好的估計式至少具備三個性質: -無偏(unbiased) -有效率(efficient) -一致(consistent) -最佳線性無偏估計式(bestLinearunbiasedestimator,BLUE) -Gauss-MarkovTheorem >...theordinaryleastsquares(OLS)estimatorhasthelowestsamplingvariancewithintheclassoflinearunbiasedestimators,iftheerrorsinthelinearregressionmodelareuncorrelated,haveequalvariancesandexpectationvalueofzero. -充分統計量(sufficientstatistic)與最小變異數不偏估計(uniformlyminimum-varianceunbiasedestimator,UMVUE) -詳見數理統計。

-區間估計(intervalestimation)[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture9_IntervalEst_2019.pdf) -什麼是95%的信賴區間(confidenceinterval)?
![](https://i.imgur.com/BXhjfPu.png=400x)
-變異數分析(analysisofvariance,ANOVA)[pdf](http://amath2.nchu.edu.tw/honda/605Lecture/Lecture10_ANOVA.pdf)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_4_ANOVA_example.ipynb) -Whynott-test?[link](https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/) >Anothermeasuretocomparethesamplesiscalledat-test.Whenwehaveonlytwosamples,t-testandANOVAgivethesameresults.However,usingat-testwouldnotbereliableincaseswheretherearemorethan2samples.Ifweconductmultiplet-testsforcomparingmorethantwosamples,itwillhavea**confoundingeffect**ontheerrorrateoftheresult. -Confoundingeffect:https://www.scribbr.com/methodology/confounding-variables/ -更多案例們: -One-wayANOVA:https://www.pythonfordatascience.org/anova-python/ -Two-wayANOVA:http://www.pybloggers.com/2016/03/three-ways-to-do-a-two-way-anova-with-python/ -Designofexperiments(DoE) -三大基本原則:randomization、replication、blocking -去除干擾變數對反映變數的影響: -未知且不可控:fullrandomization -已知但不可控:analysisofcovariance(ANCOVA) -已知且可控:blocking(oneofmethodforlocalcontroltolowerSSEandincreaseprecision) -Latinsquaredesign,LSD
![](https://i.imgur.com/pp5QR5A.png=400x)
-漸進理論[pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture7_Asymptotics_2019.pdf) -收斂性(convergence) -大數法則(LawofLargeNumber,LLN) -中央極限定理(CentralLimitingTheorem,CLT) >Thefactthatsamplingdistributionscanapproximateanormaldistributionhascriticalimplications.Instatistics,thenormalityassumptionisvitalforparametrichypothesistestsofthemean,suchasthet-test.Consequently,youmightthinkthatthesetestsarenotvalidwhenthedataarenonnormallydistributed.However,ifyoursamplesizeislargeenough,thecentrallimittheoremkicksinandproducessamplingdistributionsthatapproximateanormaldistribution.Thisfactallowsyoutousethesehypothesistestsevenwhenyourdataarenonnormallydistributed—aslongasyoursamplesizeislargeenough.See[link](https://statisticsbyjim.com/basics/central-limit-theorem/). -補充材料: -https://python.quantecon.org/lln_clt.html :::success ++Lab4++檢驗中央極限定理 撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本,找出最小的樣本大小使其樣本分佈不被常態檢定(normalitytest)拒絕: 1.標準均勻分配 2.卡方分配($\chi^2$distributionwithdf=3) 3.Poisson分配(Poissondistributionwith$\mu=3$) 4.柯西分配(thestandardCauchyDistribution)
![](https://i.imgur.com/KlEobFi.png=300x)![](https://i.imgur.com/X6D5TA9.png=300x) ![](https://i.imgur.com/U6B6bFG.png=300x)![](https://i.imgur.com/ShQeTfB.png=300x) [DemoCode](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab4_demo.ipynb)
:::
![](https://i.imgur.com/EZwESD1.png=250x)
-時間序列分析(timeseriesanalysis)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_4_time_series.ipynb) -自相關性(autocorrelation) -平穩性質(stationariness)與單根檢定(unitroottest) -自回歸模型(autoregressivemodel,AR) -移動平均模型(movingaveragemodel,MA) -ARMA$(p,q)$與ARIMA$(p,d,q)$模型 -貝氏機率(Bayesianprobability)[pdf](http://pillowlab.princeton.edu/teaching/mathtools16/slides/lec13_BayesRule.pdf) -MarcGarcia,[Bayesianinferencetutorial:ahelloworldexample](https://datapythonista.me/blog/bayesian-inference-tutorial-a-hello-world-example.html),2020 -SamuelHinton,[BayesianLinearRegressioninPython](https://cosmiccoding.com.au/tutorials/bayes_lin_reg),2019 -參考材料: -(FYR)從經驗中學習-直觀理解貝氏定理及其應用[link](https://leemeng.tw/intuitive-understandind-of-bayes-rules-and-learn-from-experience.html) -(FYR)別再瞎猜、靠運氣!NASA、微軟都在用「貝式理論」做決策[link](https://buzzorange.com/techorange/2019/07/24/nasa-how-to-make-right-decision/) -(FYR)Chapter12:BayesianInference,StatisticalMachineLearning,CMU[pdf](http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf) -[IntroductiontoBayesianModelingwithPyMC3](https://juanitorduz.github.io/intro_pymc3/) -[Bayes’RuleWithPython](http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithPython.pdf) -[MontyHallProblem](https://en.wikipedia.org/wiki/Monty_Hall_problem) -https://www.astronomy.swin.edu.au/~cblake/StatsLecture4.pdf -https://astrostatistics.psu.edu/RLectures/IntroBayes-1.pdf -https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_bayesiandecision.pdf -[貝氏統計學的概念.pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/貝氏統計學的概念.pdf) ###機器學習導論 -(FYR)DeepMind:ADocumentaryFile[youtube](https://www.youtube.com/watch?v=WXuK6gekU1Y) -回歸分析(regression)[notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_5_machine_learning_tutorial.ipynb) -Ridgeregression -LASSOregression -Logisticregression -支持向量機(supportvectormachine,SVM) -決策樹(decisiontree)與隨機森林(randomforest) -主成分分析(principalcomponentanalysis,PCA) -https://setosa.io/ev/principal-component-analysis/ -K-meansclustering -增強式學習(reinforcementlearning):Q-Learning -深度學習(deeplearning) -https://www.kaggle.com/learn/intro-to-deep-learning -案例學習:JackyHsueh,[為什麼需要經濟理論來預測經濟趨勢:比較機器學習與計量經濟](http://economicsnote.com/%E7%82%BA%E4%BB%80%E9%BA%BC%E9%9C%80%E8%A6%81%E7%B6%93%E6%BF%9F%E7%90%86%E8%AB%96%E4%BE%86%E9%A0%90%E6%B8%AC%E7%B6%93%E6%BF%9F%E8%B6%A8%E5%8B%A2%E6%AF%94%E8%BC%83%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92/),2021.2.26 -[Vapnik–Chervonenkisdimension](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension) -[Receiveroperatingcharacteristic(ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
![](https://i.imgur.com/PY21A4V.png=350x)
:::success ++Lab5++K-Means演算法實作 撰寫一個程式實現K-Means演算法。

[樣板程式](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab5_template.ipynb)內已經能夠產生測試用的資料,如下方左圖所示。

K-Means的基本精神在於透過**距離**的遠近來歸納組別。

演算法步驟可以參考此[連結](https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm_(naive_k-means))。

該演算法分群的結果如下方右圖所示,其中紅色菱形的符號代表該群的算術中心點。

我的目標是希望學員可以實現基本的K-Means演算法。

注意,分群結果(右圖)沒有保證會跟正確答案(左圖)相同,故本程式的重點是演算法的實作。


![](https://i.imgur.com/329c7JV.png=700x)
::: ###統計實務 -無母數分析 -等級相關 -Spearman等級相關係數 -Kendall等級相關係數 -單一母體 -符號檢定(signtest) -Wilcoxcon符號等級檢定 -兩相依母體 -配對符號檢定 -Wilcoxcon配對符號等級檢定 -兩獨立母體 -Wilcoxon等級和檢定 -Mann-WhitneyU檢定 -多獨立母體 -Kruskal-Wallis檢定 -多相依母體 -Friedman檢定 -隨機性檢定 -連檢定 -核密度函數估計(Kerneldensityestimation,KDE)[link](https://scikit-learn.org/stable/modules/density.html)
![](https://i.imgur.com/ATECGO7.png=400x)
-小樣本分析(small-sampleanalysis) -Fishertest[link](http://blog.pulipuli.info/2017/05/fishers-exact-test-example.html)
![](https://i.imgur.com/JyiaDyG.jpg=250x)
-雙峰/多峰分佈(bimodal/multimodaldistribution) -https://en.wikipedia.org/wiki/Mixture_model -極值理論(Extremevaluetheory,EVT) -https://en.wikipedia.org/wiki/Heavy-tailed_distribution -https://en.wikipedia.org/wiki/Extreme_value_theory ##候選題目 -華人有冬天進補的文化,若今年冬天溫度特別的低,請問是否會影響到食品類股的價格上揚? -MarkovChainMonteCarlo(MCMC) -當沖金額佔當日交易金額的比例增加時,是否意味著行情即將轉空? -流行病學模型https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology -到底是缺電?還是超用? -挖礦的電力消耗占比? -時間序列版本的$R^2$。

-$R^2$為一個beta分佈,開n次方之後為一個常態分佈。

-布朗運動 -布朗橋(Brownianbridge) -Arcsinelaw:https://scipython.com/blog/the-arcsine-law/ -WarinUkraine:https://ourworldindata.org/ukraine-war ##資料來源 ###台灣政府公開資料 -政府開放資料中心:https://data.gov.tw/ -臺北市資料大平臺:https://data.taipei/ -中央氣象局公開資料:https://opendata.cwb.gov.tw/dataset/observation -薪情體驗:https://earnings.dgbas.gov.tw/experience_sub_01.aspx -https://www.numbeo.com/cost-of-living/ -國家發展委員會人口推估查詢系統:https://pop-proj.ndc.gov.tw/index.aspx -內政部統計處:https://www.moi.gov.tw/stat/index.aspx -內政部不動產交易實價查詢:https://lvr.land.moi.gov.tw/homePage.action -用程式分析房地產可行嗎?房價分析看這裡!by[FinLab](https://www.finlab.tw/real-estate-analasys-histograms/) -文化部資料開放服務網https://opendata.culture.tw -台灣電力公司https://www.taipower.com.tw/tc/index.aspx -政府資料開放平臺資料集清單-台灣電力股份有限公司[link](https://sheethub.com/data.gov.tw/%E6%94%BF%E5%BA%9C%E8%B3%87%E6%96%99%E9%96%8B%E6%94%BE%E5%B9%B3%E8%87%BA%E8%B3%87%E6%96%99%E9%9B%86%E6%B8%85%E5%96%AE/i/44/%E5%8F%B0%E7%81%A3%E9%9B%BB%E5%8A%9B%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8) -彩券相關 -超讚的樂透網:https://zan01.com/ -樂透堂:http://www.9800.com.tw/ ###國外公開資料來源 -U.S.CensusBureau:https://www.census.gov/ -WorldBank:https://data.worldbank.org/ -NASA:https://nasa.github.io/data-nasa-gov-frontpage/data_visualizations.html -DataWorld:https://data.world/ -HumanDevelopmentReports:http://www.hdr.undp.org/en -SportsReference:https://www.baseball-reference.com/ -DatabankofBankofEngland:https://www.bankofengland.co.uk/statistics ###分析平台 -GoogleDataStudio:https://datastudio.google.com/ ###競賽平台 -https://www.kaggle.com/datasets/ -https://zindi.africa/competitions ##參考資料 ###書籍 ####曾經使用過的教科書 -ThomasHaslwanter,[AnIntroductiontoStatisticswithPython](https://www.springer.com/us/book/9783319283159),2016可在台大校園IP範圍內進行下載! ![](https://i.imgur.com/ypOd5fQ.png=100x) -JoséUnpingco,[PythonforProbability,Statistics,andMachineLearning](https://link.springer.com/book/10.1007%2F978-3-319-30717-6),2/e [link](https://link.springer.com/book/10.1007%2F978-3-030-18545-9)可在台大校園IP範圍內進行下載! ![](https://i.imgur.com/vdfeIre.png=100x) -JakeVanderPlas,[PythonDataScienceHandbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/),2016[online](https://github.com/jakevdp/PythonDataScienceHandbook)[github](https://github.com/jakevdp/PythonDataScienceHandbook) ![](https://i.imgur.com/Bdf0PTa.png=100x) ####機率論 -SheldonRoss,[IntroductiontoProbabilityModels](https://www.elsevier.com/books/introduction-to-probability-models/ross/978-0-12-814346-9),12/e,2019 ![](https://i.imgur.com/57W4fGF.jpg=100x) ####數理統計 -RobertV.Hogg,JosephW.McKean,andAllenT.Craig,[IntroductiontoMathematicalStatistics](https://www.amazon.com/-/zh_TW/dp/0134686993/),8/e,2019 ![](https://i.imgur.com/2juN4uq.png=100x) -GeorgeCasellaandRogerL.Berger,[StatisticalInference](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126),2/e,2001 ![](https://i.imgur.com/D9fxPRb.png=100x) ####實驗設計 -DouglasC.Montgomery,[DesignandAnalysisofExperiments](https://www.wiley.com/en-gb/Design+and+Analysis+of+Experiments,+9th+Edition-p-9781119320937),9/e,2017 ![](https://i.imgur.com/KH0xEM8.jpg=100x) -AngelaDean,DanielVoss,andDanelDraguljić,[DesignandAnalysisofExperiments](https://link.springer.com/book/10.1007/978-3-319-52250-0),2017 ![](https://i.imgur.com/5JKXVjP.jpg=100x) ####統計學通論 -BarbaraBlatchley,[StatisticsinContext](https://www.amazon.com/Statistics-Context-Barbara-Blatchley/dp/0190278951),2018 ![](https://i.imgur.com/MFDwoYZ.png=100x) -張翔與廖崇智,++提綱挈領學統計++,第八版,2019/6/14 ![](https://i.imgur.com/DyXbeFs.png=130x) -許誠哲,++統計學:重點觀念與題解++,2018/3/1 ![](https://i.imgur.com/BdkBLQ0.png=90x)![](https://i.imgur.com/2w0llwW.png=90x) -DavidM.Laneandetc,++OnlineStatisticsEducation++:http://onlinestatbook.com/Online_Statistics_Education.pdf ####時間序列 -陳旭昇,[時間序列分析-總體經濟與財務金融之應用](http://homepage.ntu.edu.tw/~sschen/Book/Book2.htm),第二版 ![](https://i.imgur.com/b2Ubehb.png=100x) -++IntroductiontoEconometricswithR++:https://www.econometrics-with-r.org/index.html ####機器學習 -GarethJames,DanielaWitten,TrevorHastie,andRobertTibshirani,[AnIntroductiontoStatisticalLearningwithApplicationsinR](https://link.springer.com/book/10.1007/978-1-4614-7138-7),2013 ![](https://i.imgur.com/sCY7WYs.png=100x) -TrevorHastie,RobertTibshirani,andJeromeFriedman,[TheElementsofStatisticalLearning:DataMining,Inference,andPrediction](https://link.springer.com/book/10.1007/978-0-387-84858-7),2009 ![](https://i.imgur.com/u1Mi1mt.png=100x) -OvidiuCalin,[DeepLearningArchitectures](https://link.springer.com/book/10.1007/978-3-030-36721-3),2020 ![](https://hackmd.io/_uploads/rkX2aVN6q.png=100x) ###科學普及閱讀 -HistoryofStatistics:https://www.york.ac.uk/depts/maths/histstat/ ![](https://i.imgur.com/UzJP2o3.png=100x) -安德魯·維克斯,++34個讓你豁然開朗的統計學小故事++,2019/03/28 ![](https://i.imgur.com/QYCD6Fo.png=140x) -羅伯特·艾貝爾森,++一位耶魯大學教授的統計箴言++,2019/05/28 ![](https://i.imgur.com/okSuBrz.png=100x) -塚本邦尊...等,++東京大學資料科學家養成全書:使用Python動手學習資料分++ ![](https://i.imgur.com/FebESuA.png=150x) ###國外內課程 -Prof.Shiu-ShengChen(http://homepage.ntu.edu.tw/~sschen/) -Dr.Shao-WeiCheng(http://www.stat.nthu.edu.tw/~swcheng/) -++Statistics++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2820/index.php -++Probability++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2810/index.php -++ExperimentalDesignandAnalysis++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5510/index.php -++MathematicalStatistics++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat3875/index.php -++DiscreteAnalysis++:http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/index.html -++STA-663-2017++:http://people.duke.edu/~ccc14/sta-663-2017/ -Dr.PeterKempthorne,++LecturenotesonProbabilityandStatistics++:http://users.encs.concordia.ca/~doedel/courses/comp-233/slides.pdf -++MathematicalStatistic++s:https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/ -EmmanuelCandès,https://statweb.stanford.edu/~candes/ -++TheoryofStatistics++:https://statweb.stanford.edu/~candes/teaching/stats300c/ -++ModernMarkovChain++:https://statweb.stanford.edu/~candes/teaching/stats318/ -OlegMelnikov,++IntroductiontoStatisticalInference++:http://stats200.stanford.edu/ -LiamPaninski,++ComputationalStatistics++:http://www.stat.columbia.edu/~liam/teaching/compstat-spr19/ -DavidAldous,++ProbabilityandtheRealWorld++:https://www.stat.berkeley.edu/~aldous/157 -BochengJing,++Sta102-IntroBiostatistics++:https://www2.stat.duke.edu/courses/Spring13/sta102.001/ ###雜項 -Secretaryproblem -https://zh.wikipedia.org/wiki/%E7%A7%98%E6%9B%B8%E5%95%8F%E9%A1%8C -https://style.udn.com/style/story/8073/1452739 -http://www.statslife.org.uk/images/pdf/timeline-of-statistics.pdf -王超辰:醫學統計學https://bookdown.org/ccwang/medical_statistics6/ -IoaneMuniToke,++AnIntroductiontoHawkesProcesseswithApplicationstoFinance++,2011:http://lamp.ecp.fr/MAS/fiQuant/ioane_files/HawkesCourseSlides.pdf -https://ourworldindata.org/grapher/annual-working-hours-vs-gdp-per-capita-pwt -https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf 8 × Signin Email Password Forgotpassword or Byclickingbelow,youagreetoourtermsofservice. SigninviaFacebook SigninviaTwitter SigninviaGitHub SigninviaDropbox SigninviaGoogle NewtoHackMD?Signup


請為這篇文章評分?