STEM Research & Engineering Design 2018-19: October 2018

Wednesday, October 31, 2018

Input Features

Input Features The input features consist of three sets of variables.

The first set is the historical daily trading data of INTC including previous 5 day’s adjusted closing price, log returns, and OHLC variables. These features provide basic information about INTC stock.
The second set is the technical indicators that demonstrate various characteristics of the stock’s behavior.
The third set is about indexes: S&P500, CBOE Volatility Index, and PHLX Semiconductor Sector Index.

Category 1. Daily Trading Data

S&P 500

An American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE/ NASDAQ

The CBOE Volatility Index: a popular measure of the stock market’s expectation of volatility

The PHLX Semiconductor Sector: an index composed of companies primarily involved in the design, distribution, manufacture, and sale of semiconductors

Other performance measures

Allocative efficiency

Total surplus (welfare) is my key measure of market performance. Welfare indicates how well the market allocates trades according to underlying private valuations.

Liquidity

Markets are liquid to the extent they maintain availability of opportunities to trade at prevailing prices.
In other words: Liquidity is defined as the ability to exchange an asset for money at a price as close as possible to the equilibrium price

Price discover- This reflects how well prices incorporate information.

Literature Survey

Quantitative investment (QI) products (models/tools/systems) can provide accurate stock market prediction and help investors significantly alleviate risks of mispricing and irrational trading because of psychological factors, such as overconfidence, mental accounting, loss aversion, and so on

Article: A causal feature selection algorithm for stock prediction modeling

Zhang, Xiangzhou, et al. “A Causal Feature Selection Algorithm for Stock Prediction Modeling.”

NeuroImage, Academic Press, 9 May 2014,

www.sciencedirect.com/science/article/pii/S0925231214005359.

The main issue of quantitative investment (QI) is what features to include. This article uses algorithms in data-based analysis to enhance the QI. It also states that CFS (casual feature selection) algorithms are the best predictors. They are able to identify the variable and generate a feature subset based on the results. Other algorithms in existence are principal component analysis (PCA), decision trees (DT:CART), and the least absolute shrinkage and selection operator (LASSO). CFS is the most accurate and precise and has the ability to develop the QI product. Other common stock algorithms can only reveal base level detail and not stock features (inputs) and stock return (output). CFS’s have two unique aspects, one being to identify direct influences between variables and to verify the algorithm using extensive experiments (accuracy, precision, sharpe ratio, sortino ratio, information ratio, and maximum drawdown (MDD). PCA can reduce a large set of variable to a smaller set of uncorrelated factors while keeping variance. The results of PCA is a subset of original components but not of variables, making the resulting data hard to comprehend. DT is one root (node) and a number of branches. With only features that contribute to the classification. DT uses the entropy theory that selects specific variables with the most important information. LASSO which selects individual variables.

Additionally, there are two types of stock prediction categories, time series forecasting and trend prediction. A time series forecasting model is trained to fit the historical return/ price series of individual stock, utilised to predict the future return/ price. A trend prediction model is used to obtain the relationship between various fundamental and technical variables and the movement of stock price. Dataset, prediction models and selection algorithms include SRA, PCA, and genetic algorithm (GA), informational gain, etc. Inputs vary but often both technical and fundamental variables exist such as economic variables.

Other data mining algorithms exist such as: logistic regression (LR), neural network (NN), support vector machine (SVM), decision tree (DT). Supplementary algorithms include Bayesian network (BN), Zuo nad Kita.

Feature selection can be grouped into two categories, filter and wrapper approaches. The filter uses general characteristics of the training data to select key input features. Wrapper uses the prediction performance of a special learning algorithm to evaluate and determine the best feature subset. Evolutionary algorithm and GA are used in the latter, allowing it to perform better, however it is more expensive.

Article: Using Bitcoin Pricing Data to Create a Profitable Algorithmic Trading Strategy

Link: http://cs229.stanford.edu/proj2017/final-reports/5244395.pdf

Uses data from historical GDAX prices which was able to output the current open, close, high, low price and volume every minute time interval over the past year approximately 450,000 data points. This projects goal was to buy and sell in x minutes through predicting the ratio of the price x minutes later to current prices rather than predicting a standard up/down value.

The features they used were High/Current price, Low price/Current price, Average price/Current price, volume of trading in BTC, Proportion of increase in price every minute, proportion of convex change every minute, ratio of price n minutes ago to current price, average price/current price, volume n minutes ago/current volume, etc.

(PCA and Feature Selection) WAP, VR, and R are not viable options. WAP is less varied than AP while being highly correlated. VR has low variance and an inaccurate predictor. R was similar to A however, A captured more information. Meaning, Baseline, LinReg, LonReg PCA and Neural Networks remained.

The methods that were used were Baselines, Weighted Logistic Regression, Principal Component Analysis and Neural Networks. Weighted average, gains and AUC (area under curve which measured the true positive rate vs the false positive weight) were used as measuring increments. *Refer back to article for indepth*

In conclusion, all of the listed, approved models had gains that outperformed the average increase per minute of bitcoin. Meaning that actually using them in market like situations was more effective than buying and holding. PCA had small increased gains but with conjunction to neural networks, significant gains were represented. Their algorithm was questioned to whether or not it was actually 100% reliable due to recent spikes. *This was not tested in real time*

Article: Price Prediction Evolution: from Economic Model to Machine Learning

Link: http://cs229.stanford.edu/proj2017/final-reports/5241834.pdf

Used the stock index to gain data and comprehensive market tendency. They extracted four features: Max, Min, Mean, and Standard Deviation every 5 days. They inputted the data by a Shifting Window Pattern. They first ran multilinear regressions adding features one by one. The input features they used were univar, bivar, CPI and GDP. Using GDP as the main predictor of the closing prices. They regretted using increments of 5 days as it may not have reflected the real behavior of the market. Concluding that the more macroeconomic features they used in their algorithm diluted their results. Recommending: Using only closing prices and certain macroeconomic features as a good approach.

LWR is used to predict certain features in the market. Defeisiancis of LWR is that it has time lag which then wont reflect the most up to time data. However, it is one of the most accurate predictions. NN used the max, min, mean and standard deviation to feed into the neural network. WIth 2 hidden layers activated by ReLU and no output activation layers. Using cross validation to tune the parameters. They then split the data into the train set, dev set and test set. Importantly using the dev set to adjust the models topology (number of layers, neurons per hidden layer, size of mini batch). Outputting 500 data with 481 MSE plotted into the data. SVR (support vector regression) overcomes difficulties in dimensions. Developed by SV and used it to predict stock behavior (forecasting the curve tendency). SVR is more flexible as it uses relaxation variables. In SVR they used libsvm in MATLAB and divided data into 3 groups 4000(train), 500(dev), 500(test).

Article: Application of Deep Learning to Algorithmic Trading

Link:http://cs229.stanford.edu/proj2017/final-reports/5241098.pdf

Project was based around LSTM (Long Short Term Memory) Networks. Their algorithm predicted stock data with the goal to forecast the next day’s stock price of Intel Corporation (NASDAQ: INTC) by adjusting the closing price based on information/ features available the next present day. Trading Intel stock accordingly to the strategy developed. Mostly using Locally Weighted Regression.

The Trading Framework used had 4 steps. (1) Input: used daily trading data, technical indicators and indexes. (2) Model: LSTM Network, LWR model. (3) Output: Predicting next day’s price. (4) Decision: Trading decision of whether to buy or sell.

Variables used: (3) historical daily trading data was composed of INTC’s last 5 days adjusted closing price, log returns, open/close price, high/low price, trading volume. The second set was composed of technical indicators that demonstrated various characteristics of the stock behavior. The final set included the S and P 500, CBOE Volatility index and PHLX Semiconductor Sector.

The technical indicators used were the Rolling Average/Standard Deviation with 5 and 10 days window, Bollinger Band: two standard deviations from a moving average, Average True Range: a measure to volatility of price, 1 month Momentum: the difference between current price and the price 1 month ago, Commodity Channel Index: an identification of cyclical trends, Rate of Change: the momentum divided by the price 3 months ago, Moving Average Convergence Divergence: a display trend following characteristics and momentum characteristics, Williams Percent Range: a measure of the buying and selling pressure

*Reference for model, data, outcome, price prediction and return plots*

Strategy: used trained models to compute the predicted price. If the price the next day was higher than the current price, one share of INTC was bought. If the predicted price of INTC is lower than the current price, one share was sold.

Conclusion: LSTM Network and LWR models can predict the general trend of the INTC stock price. LSTM outperforms LWR in terms of profitability and accuracy over 3 periods. LSTM is more robust than LWR. LSTM has a smaller MSE than LWR for Dev set and Test set. LSTM yields higher returns and Sharpe ratio than LWR based strategy and simple buy and hold strategies. *Exclude dramatic price changes*. Using hyper parameters and adding regularized term would improve the performance and reinforced learning could generate more stable and higher returns.

*Reference report for LSTM equations and explanation

Article: Predicting stock prices for large-cap technology companies

Link:http://cs229.stanford.edu/proj2017/final-reports/5237355.pdf

Used data from previous days and financial news articles to predict the changes in the future of a given stock. Prediction was excluded 10 days before and after the company’s Earnings report dates to reduce dramatic changes. Overall, this algorithm was 59% accurate and annualized a return of about 15%.

Used NASDAQ(.com) to gain data from 5 years ago. News data was pulled from XIGNITE(.com) and are processed to remove duplications and have exact dates assigned to each headline.

To digest the news articles, Naive Bayes theory was used which modeled 400 days and tested on 200 days of data. The output is the probability of a positive relative price change. Fixed tokens were used as replacements to numbers, percentages, and money amounts.

Features of the NN model: Relative price changes for the same stock from the past 20 trading days, News predictors from the last 20 trading days, news predictor of the current day. The results show that percentage gain is the most important factor. This project has 56.35% accurate predictions with news and without news resulted in 52.38%. NN w/ news produces the most accurate results with 59% accuracy and a daily gain of 0.0423%

In reflecting the author realized that stock prediction was too complex to be captured by price changes and news. XIGNITE(.com)’s news could have had higher quality news. Reducing and removing stemmed words into base forms, insignificant words, etc would improve prediction accuracy. And including trends in industries, political influences, competitor trends, etc to capture all the complexities of a stock market.

Article: Using AI to Make Predictions on Stock Market

Link:http://cs229.stanford.edu/proj2017/final-reports/5212256.pdf

Goal is to create an automated tool for managing investments with limited amounts of stocks. First, they designed an algorithm to predict increases and decreases in the next n days, using the stock prices and volumes in the past m days. Alpha Vantage API was used to access the daily open price, high price, low price, close price and daily volume since 2000.

Predictions based on Technical Indicators: Moving Average Convergence Divergence (MACD), Stochastic oscillator (STOCH), Relative strength index values (RSI), Average directional movement index (ADX) values, Absolute price oscillator values with SMA (Apo-SMA/ APO-EMA), Commodity channel index values (CCI), Aroon, Bollinger bands (BBANDS) values, Chaikin A/D line (AD) values, On balance volume (OBV) values.

Results: The variable targeted was the difference between the prices over a n-day period using the predictors: the stock price at the end of the day, beginning of period, volume of the same day, and the 11 indicators from the previous section. Using the training set, dev set, and test set in a ratio of 7:2:1. And evaluating the data with classification (of the price trend (increase/decrease)) and the mean squared error. Data shows that the support vector regression gives the best result however, it is expensive. So, for larger data being processed, linear regression is used.

Article: Optimised Prediction of Stock Prices with Newspaper Articles

Link:http://cs229.stanford.edu/proj2016spr/report/010.pdf

Explored the predictive power of using newspaper articles on the stock prices of companies. The testing accuracy was up to 61% using supervised learning. Additionally, the prediction algorithm ran through MDP (Markov decision process) which buys and sells shares on a simulation that is programmed. Binary responses along with machine learning and newspaper articles were used to predict up or down changes in stock prices. Past algorithms include Naive Bayes algorithm, SVM, the Perceptron, Boosting and bag of words. In addition, they expanded the bag of words to include shorter and more words. Term frequency-Inverse Document Frequency (TF-IDF) was used to improve the algorithms ability to grasp words and cross validate. Afterwords, a Markov Decision Process (MDP) learner was created that builds the base predictor and reinforces its ability to buy and sell shares.

Previous work includes (1) Gidofalvi: implemented naive bayes text classifier on financial news to track short term fluctuation prices, concluding that there is a strong correlation between news articles and stock prices in a 20 minute window before and after the publishing of the article.

Data: accessing New York Times and the Wall street Journal was difficult do to previous set up guidelines that hindered website access. Making the newspaper research reliant on Proquest Newsstand database. The input went from item term searches to federated search algorithms to PQNS XML TREE to XML tree parsing algorithm to PQNS article URL to Article Parsing algorithm to raw text to stemming algorithm to stemmed text to output.

Proquest newsstand archives all articles to a searchable database. With a written algorithm to generate federated search URL’s you can automatically obtain PQMS XML Tree which contains the URLs of the full texts of articles that mention the company of interest. Writing an XML tree parser will allow data to automatically be compiled into a list of PQNS URLs where the full texts are contained. Proquest needs a validating URL, meaning that there must also be a web scraper specifically to work around limitations of the Proquest database generating cURLs so that the article algorithmic gatherer can act like a real user’s browser. The use of regular expressions to scan full texts from webpages to obtain raw data to go to the Python stemming library for pre-processing.

Article: Using News Articles to Predict Stock Price Movements

Link:http://cseweb.ucsd.edu/~elkan/254spring01/gidofalvirep.pdf

Short term stock price movements can be predicted by financial news articles.

Tuesday, October 30, 2018

10/29/18

In class today we watched presentations and worked on our own presentations. I have created a camera app in the sense that all of the tutorials I’ve watched say that I’m doing the right thing, but still there are 26 errors (there used to be 51 so I’m making progress yay!). I think that the errors are due to things such as out of date software so they will be an easy fix. However also maybe the errors include flaws with the code, specifically with initializers and identifiers. I plan to continue work on it overnight the next week, hopefully completing it by Friday.

Monday, October 29, 2018

Icon Making: Photoshop.

I created my new icon by creating two layers; black and white. Then I placed the white layer at the top and used the eraser to draw out the logo. I drew an eye with my initials within them due to the age old saying: "The eyes are the window of the soul". What sort of tools did you use in Photoshop during the making of your icon?

Thursday, October 25, 2018

MATLAB Tutorials

MATLAB Video Tutorials Progress Chart

MATLAB Video Tutorials: Watch the MATLAB tutorials such that you can use this powerful tool to help you solve a wide spectrum of STEM problems. Please use the Progress Chart to document your personal progress in watching the tutorials.

Part I: MATLAB Overview

MATLAB Overview: Get an overview of MATLAB®, the language of technical computing. (2:05)
Analyzing and Visualizing Data with MATLAB: Explore, visualize, and model your data with MATLAB®. (3:26)
Programming and Developing Algorithms with MATLAB: Write programs and develop algorithms using the high-level language and development tools in MATLAB®. (4:32)
Developing and Deploying Applications with MATLAB: Develop and share MATLAB® applications as code, executables, or software components. (3:51)
Getting Started with MATLAB: Get started with MATLAB® and learn how to get more information. (7:00)
Working in The Development Environment: Access tools such as the command history workspace browser and variable editor, save and load your workspace data, and manage windows and desktop layout. (5:21)
Top Ways to Get Help: Find online support to help solve your toughest problems while using MATLAB® and Simulink® products. (3:20)
Importing Data from Text Files Interactively: Use the import tool to import numeric and text data from delimited and fixed width text files. Generate MATLAB^® code to repeat the process on similar files. (7:01)
Importing Data from Files Programmatically: Import data from spreadsheets, text files, and other formats into MATLAB^® using file I/O functions. (3:55)
Importing Spreadsheets into MATLAB: Select and load mixed textual and numeric data from spreadsheets interactively then generate the required MATLAB^® code. (4:34)
Using Basic Plotting Functions: Create plots programmatically using basic plotting functions. (5:52)
Working with Arrays in MATLAB: Create and manipulate MATLAB® arrays, including accessing elements using indexing. (8:17)
Introducing MATLAB Fundamental Classes (Data Types): Work with numerical, textual, and logical data types. (5:46)
Introducing Tables and Categorical Arrays: Manage mixed-type tabular data with the table data container, and data from a finite, discrete set of categories with the memory-efficient categorical array. (6:01)
Introducing Structures and Cell Arrays: Use structures and cell arrays to manage heterogeneous data of different types and sizes. (5:04)
Writing a MATLAB Program: Write a MATLAB® program, including creating a script and a function. (4:57)
Publishing MATLAB Code from the Editor: Share your work by publishing MATLAB® code from the MATLAB Editor to HTML and other formats. (5:57)
Developing Classes Overview: Design classes by defining properties, methods, and events in a class definition file. (10:48)
Calling MATLAB from C Code: Call MATLAB® from C, C++ or Fortran code using the MATLAB Engine Library. (1:30)

Part II: Computer Programming with MATLAB

Lesson 1: Introduction
1.1 Introduction (11:43)
1.2 The MATLAB Environment (23:03)
1.3 MATLAB Online (27:43)
1.4 MATLAB as a Calculator (14:25)
1.5 Syntax and Semantics (5:01)
1.6 Help (8:37)
1.7 Plotting (19:06)

Lesson 2: Matrices and Operators
2.1 Introduction to Matrices and Operators (11:25)
2.2 The Colon Operator (8:45)
2.3 Accessing Parts of a Matrix (21:33)
2.4 Combining and Transforming Matrices (10:06)
2.5 Arithmetic Part 1 (18:07)
2.6 Arithmetic Part 2 (11:52)
2.7 Operator Precedence (13:31)

Lesson 3: Functions
3.1 Introductions to Functions (5:39)
3.2 Function I/O (22:15)
3.3 Formal Definition of Functions (2:52)
3.4 Subfunctions (6:17)
3.5 Scope (5:24)
3.6 Advantages of Functions (2:39)
3.7 Scripts (4:27)

Lesson 4: Programmer’s Toolbox
4.1 Introduction to programmer’s Toolbox (7:06)
4.2 Matrix Building (15:11)
4.3 Input / Output (20:47)
4.4 Plotting (17:47)
4.5 Debugging (22:17)

Lesson 5: Selection
5.1 Selection (11:53)
5.2 If-Statements, continued (8:33)
5.3 Relational and Logical Operators (34:51)
5.4 Nested If-Statements (2:12)
5.5 Variable Number of Function Arguments (6:40)
5.6 Robustness (8:37)
5.7 Persistent Variables (6:54)

Lesson 6: Loops
6.1 For-Loops (38:50)
6.2 While-Loops (20:16)
6.3 Break Statement (29:31)
6.4 Logical Indexing (37:29)
6.5 Preallocation (8:59)

Lesson 7: Data Types
7.1 Introduction to Data Types (20:27)
7.2 Strings (29:04)
7.3. Structs (14:51)
7.4 Cells (21:47)

Lesson 8: File Input/Output
8.1 Introduction to File Input/Output (15:00)
8.2 Excel Files (9:12)
8.3 Text Files (12:17)
8.4 Binary Files (38:55)

Wednesday, October 17, 2018

STEM Seminar

We are going to have our STEM Seminar starting next Tuesday (10/23). Each individual should prepare to present a key topic of your research field in depth based on both your study and research experience. The contents should be informative and educational to the class. The presentation should be no longer than 15 minutes (including 3 minutes for Q&A). You are encouraged to use PowerPoint, videos, demonstrations and handouts to help the class understand and grasp the new materials in a short time. Presentation materials should be in professional quality: concise, accurate, logical, rich in contents, and visually pleasant. Please plan, coordinate, and rehearse your presentation in advance.

Final Compaosition Project

One thing that I have taken away from this project would be the importance of angles, as certain photos used in my project look much better when taken from one angle compared to another. Knowing the importance of angles, I will be able to take photographs that are more accurate to what I envision in the future.

Sunday, October 7, 2018

Steps of Literature Survey

Step 1: Search sources

Use internet, library and other possible medias to find literature sources. Keyword, category and author searches are effective tools. Use hyperlinks to broaden and deepen your search.

Step 2: Filter information

Browse through the collected sources, read the abstracts, and select the ones relevant to your research field. Searching and filtering are iterative processes. The process leads to better understanding of your research field.

Step 3: Create a summary

Recap the important information and state the essences of the sources relevant to your research. It might trace the research progression of the field.

Step 4: Create a synthesis

Re-organize the information in a meaningful way, sometimes across the sources. Evaluate the sources, analyze the strengths and limitations of the research, and advise the readers on the most pertinent or relevant.

Further literature survey details can be found in the presentation file.