What is machine learning?
Simply say, is a kind of machine learning algorithms automatically obtained rules from the data, and use the law to predict the unknown data algorithms.
Illustrated with a picture of the content it contains:
Editor's Note: this chart is large, want to see clear here. Actually could not see it does not matter, this map is a basic context, the focus of this article is not here, Lei feng's network following the relevant science articles, there will be a detailed explanation.
We focus on the figure of supervised learning, it refers to the data included in the property we want to forecast, supervised learning in the following two categories:
Classification (Classification)-the samples belong to two or more categories, we wanted to learn from the data have labeled category, to predict classification of unlabeled data. For example, the recognition of handwritten numeral is a classification problem, the objective of each input vector corresponding to the finite number of categories. From another perspective, supervised learning for classification is discrete (as opposed to continuous) forms, for a sample of n, have a limited number of categories, the other side is trying to tag samples and allocated to the correct category.
Return (Regression) – If you want the output of one or more continuous variables, so this task was called back, such as age and weight functions to predict the length of the salmon.
scikit-learn
Scikit-learn is a package based on NumPy, SciPy, Matplotlib, machine learning, mainly covering classification, regression, and clustering, machine learning algorithms. Like knn, SVM, logistic regression, Naive Bayes, random forests, k-means, and so on, in short: it is a powerful wheel.
There is a very good example: Anderson forecast of subgenus IRIS varieties.
We have 150 IRIS observed value of some size of: sepal length, width, petal length and width. Also their subgenus: mountain Iris (Iris setosa), color Iris (Iris Versicolor) and Virginia Iris (Iris virginica). We use this data to learn and predict a new data. In the scikit-learn, by creating an estimator (estimator) learn from the data already exists, and call it the fit (x, y).
Code is as follows:
Can not read the code it doesn't matter, this output: array ([0]).
That study concluded that the sepal length, width, petal length and width measurements were 5.0 and 3.6, 1.3, 0.25 subgenera of IRIS Anderson mountain Iris (Iris setosa).
My goal is to personally experience in implementing validation machine learning market forecast a build process, incidentally, take a look at this stuff is not literature or research is reported in the "legend" of God or no egg.
So, in quantifying the financial aspects of machine learning how to use it? For example, stock it mark it?
First of all, we have to be familiar with our data. Gets the CSI300 index raw data over the past decade (code development environment and Ipython Notebook): df = Rd.get_price ('CSI300. INDX', '2005-01-01', '2015-07-25').reset_index()[['OpeningPx', 'ClosingPx']]
Open after closing price, after our such as raw data, with the following three pictures.
Figure one:
(Nearly 2,500 in the past trading days, up or down the number of days in the day statistics)
Figure II:
(The daily yield changes with time series)
Figure III:
(Change the frequency distribution of the number of days)
Interested friends can take a closer look, there is something very interesting. Familiar with data to be started, do I as three points attempts:
1, machine learning, Estimator selection, methods that we use for our forecast.
2, the selection of training samples, that is, every time we use a number of training the collection before forecast results of samples.
3, change the time window of choice, the number of features in each of our sample, we train each unit contains how many consecutive days of rise.
Following in particular:
1, according to the data at hand and scikit-learn:machine learning in Python Guide, as shown in the next figure:
We choose RandomForestClassifier, LinearSVC, KNeighborsClassifier, as follows:
It can be seen that KNeighborsClassifier performed significantly worse than those used in RandomForestClassifier, LinearSVC, its volatility and winning percentage and the other two are not ideal. This result a fantastic article with JMLR is somewhat similar: Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?, article 179 classification model is tested in all UCI 121 properties on the data, found that Random Forests and SVM performance.
2, and restrict the prediction accuracy of training set samples, ideally, I hope that the predicted number of samples at a time, the more the better, but you know that ideal is very skinny, while training set samples restricted by the total actual amount of data. Hermes iPhone 5S case
In addition, the calculation of resource and time constraints. We will eventually have to form some kind of compromise, which to a certain extent under results, select the minimum number of training samples. Then we calculate the number of samples from the 1~300 range of winning, results are as follows:
It can be seen that control other conditions do not change, with the increase number of samples, winning improved results more stable at around the 0.52~0.53 and the last wave. In order to save resources and taking into account the amount of historical data, we can choose 100 as the number of training samples.
3, change the time window to select it. In fact reflected the impact of change on the next trading day transaction history.
This momentum is objective? I think from the trading psychology says that there is a basis, such as if the past 10 trading days in a row as traders all red, trend I would like to caution the day after for short. Of course, this is an extreme fantasy, end of the show, still want to see data for the answer:
Such results make people mad, a little garbled. Then I changed the starting point of each back-test found that each time the results are similar. A have in common is the falling curve beginning has collapsed at a time, and stable oscillations at about 0.5 probability of a coin.
That is, the momentum is there, just very small (a combination of two earlier experimental results expected at 0.53 position), and the short time window, outside this time window, forecast into coin problem.
Above is my prediction for machine learning in the financial markets a small test application, comprehensive results of the three figures. Probability can speak better than pure coin Diu Diu (below 0.5 is rare, debugger found 0.53 is something of a magic number), but this is, after all, I own a small Demo. You can imagine, if there is a better algorithm, more data, more reasonable choice of features, unexpected results are reasonable.
After his attempt, I think machine learning in the forecasting of financial markets will not be God, can't say it without eggs, I believe the Holy Grail exist, have been found in one of your details.
"The author" easunlu, a data engineer, like machine learning, financial applications have some research in machine learning. Talk about machine learning more, can be entered here: Ipython Notebook Research Alpha machine learning, think about the falling rise. Machine learning enthusiasts are also welcome positive messages, discussion with the author.
113 votes
Jibo Intelligent robot family
Subject to price factor, the robot has not mass consumption of outstanding products in the field, I believe many people will recall that long before Sony launched the family's robot dog, lack of practical function and high prices are making it difficult for the robot dog walked into too many families. Now, a robot for family life. Hermes iPhone cover
View details of the voting >>