【数据科学】Data Science Resources

Some Useful Blogs: R blogger， FastML Kaggle
1.Dealing with missing values
Three type of situations: Missing completely at random, missing at random, missting not at random(maybe in a systematic way, visualizing NA)
2.Feature Seletion and engineering
Three type of methods: Feature ranking and Subset selection
In feature ranking method, it contains , correlation filter and entropy based filter.
subset selection includes filter method
 wrapper method
 embedded method, one most known method is LASSO, assign zero coefficients to some variables
useful papers
The most useful one: An Introduction to Variable and Feature Selection
Feature Selection for HighDimensional Data: A Fast CorrelationBased Filter Solution
Toward Integrating Feature Selection Algorithms for Classification and Clustering
Useful R packages called FSelector : R package3.Model Selection
Many champions of Kaggle competitions chose xgboost as their model. They also try ensemble methods to combine weak learners. You could use some simple models as baseline methods, for instance logistic regression.
 More deep interesting papers:
• Greedy function approximation a gradient boosting machine. J.H. Friedman
• Stochastic Gradient Boosting. J.H. Friedman
• Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani
• Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang
Entry Level Course
Coursera: Machine Learning Instructed by Andrew Ng https://goo.gl/efgkaG
Stanford Online MOOC: Statistical Learning https://goo.gl/A57b7y
Two machine learning, one is machine learning foundation, and the other one is Machine Learning Techniques, NTU courses.Medium Level to High Level
1. Stanford CS229 Andrew Ng https://goo.gl/IUUv7n
2. CMU 10701 Alex Smola https://goo.gl/aom5ZaData Science and Engineering with Spark: Learn how to use Spark, one of the leading cluster computing frameworks, to analyze big data while leveraging Spark’s APIs.
However, the above courses are only some theories. There are many dirty work in data cleaning and related data science pipeline. Running models is just one of its steps.
One small tip: Problem driven study, don’t too much work in theory and learn the above stuff, do some practice. There are much work in practice, such converting categorical data into number, and transforming the time stamp data. Let’s do it.
Machine learning cheat sheet. I like this graph, because it looks nice and intuitive.
本篇小编：Jun Fu