Peter's stats stuffI write about applications of data and analytical techniques like statistical modelling and simulation to real-world situations. I show how to access and use data, and provide examples of analytical products and the code that produced them.
http://ellisp.github.io
Moving largish data from R to H2O - spam detection with Enron emailsI finally solve my problem of writing large sparse matrices from R into SVMLight format for importing to H2O; and demonstrate application with spam detection trained on the Enron email data comparing a generalized linear model, random forest, gradient boosting machine, and deep neural network.Sat, 18 Feb 2017 00:00:00 +1300
http://ellisp.github.io/blog/2017/02/18/svmlite
http://ellisp.github.io/blog/2017/02/18/svmliteUS Presidential inauguration speechesI do some basic textual analysis and visualization with US Presidential inauguration speeches.Mon, 23 Jan 2017 00:00:00 +1300
http://ellisp.github.io/blog/2017/01/23/inaugural-speeches
http://ellisp.github.io/blog/2017/01/23/inaugural-speechesDoes seasonally adjusting first help forecasting?I test some forecasting models on nearly 3,000 seasonal timeseries to see if it's better to seasonally adjust first or to incorporate the seasonality into the model used for forecasting. Turns out it is marginally better to seasonally adjust beforehand when using an ARIMA model and it doesn't matter with exponential smoothing state space models. Automated use of Box-Cox transformations also makes forecasts with these test series slightly worse. The average effects were very small, and dwarfed by different performance on different domains and frequency of data.Sun, 22 Jan 2017 00:00:00 +1300
http://ellisp.github.io/blog/2017/01/22/forecast-seasadj-lambda
http://ellisp.github.io/blog/2017/01/22/forecast-seasadj-lambdaBooks I likeMy ten recommended books for applied statistics and data science. Then 13 more!Sat, 14 Jan 2017 00:00:00 +1300
http://ellisp.github.io/blog/2017/01/14/books
http://ellisp.github.io/blog/2017/01/14/booksCross-validation of topic modellingCross-validation of the "perplexity" from a topic model, to help determine a good number of topics.Thu, 05 Jan 2017 00:00:00 +1300
http://ellisp.github.io/blog/2017/01/05/topic-model-cv
http://ellisp.github.io/blog/2017/01/05/topic-model-cvSparse matrices, k-means clustering, topic modelling with posts on the 2004 US Presidential electionI explore different sparse matrix formats in R and moving data from R to H2O. Along the way I use k-means clustering and topic modelling to explore textual data from the Daily Kos blog on the 2004 US Presidential election.Sat, 31 Dec 2016 00:00:00 +1300
http://ellisp.github.io/blog/2016/12/31/sparse-bags
http://ellisp.github.io/blog/2016/12/31/sparse-bagsExtracting data on shadow economy from PDF tablesThe shadow economy as a percentage of GDP in wealthier countries is in decline; and had a spike in 2009 with the economic crisis. More research is needed to adequately understand it. Along the way I experiment with extracting data frames from PDF tables; and show it's always worthwhile looking at the same data in different ways, which can be as simple as freeing up the vertical axes of graphics.Mon, 26 Dec 2016 00:00:00 +1300
http://ellisp.github.io/blog/2016/12/26/shadow-economy
http://ellisp.github.io/blog/2016/12/26/shadow-economyforecastHybrid 0.3.0 on CRANforecastHybrid 0.3.0 for ensemble time series forecasting is now on CRAN. Two new features are prediction intervals for the nnetar (neural network) component of the combination; and theta method models.Sat, 24 Dec 2016 00:00:00 +1300
http://ellisp.github.io/blog/2016/12/24/forecastHybrid-0.3
http://ellisp.github.io/blog/2016/12/24/forecastHybrid-0.3Air quality in Indian citiesAir pollution in Indian cities is unambiguously seasonal, and also might have a Diwali impact.Sun, 18 Dec 2016 00:00:00 +1300
http://ellisp.github.io/blog/2016/12/18/air-quality-india
http://ellisp.github.io/blog/2016/12/18/air-quality-indiaExtrapolation is tough for trees!Tree-based predictive analytics methods like random forests and extreme gradient boosting may perform poorly with data that is out of the range of the original training data.Sat, 10 Dec 2016 00:00:00 +1300
http://ellisp.github.io/blog/2016/12/10/extrapolation
http://ellisp.github.io/blog/2016/12/10/extrapolation