inf-428-data-analytics-online

Assignment

Machine Learning

Compile all answers in a word document. For questions that require Python notebooks or KNIME workflows submit a screen shot.

The data needed for the assignment is here

dataingdata.csv - this is the dating data. Should be ready for Machine Learning in KNIME. May need to convert the ‘does she like’ column for processing with Python.

titanic data - is the original titanic data. Useful starting point for Python. This notebook shows additional preprocessing steps that may be needed.

titanic_preprocessed.csv - it the titanic data with some preprocessing done to deal with missing data. Use this for the titanic question. (note more preprocessing may be needed)

Explain why we divide data into testing and training sets. (1 point)
A class takes a test and the entire class does horrible. The prof gives them a second chance. The next week a) The prof gives them the exact same test over again, same questions, same answers. The students do well this time. Have they really learned anything? b) Alternatively the prof give them a slightly different test on the same material. Different questions but on the same concepts. The students do well this time. Have they learned anything? c) Explain how this anecdote relates to the concept of ‘self prediction’ See this video for hints. (1 point)
In KNIME what node can be used to divide data into training and testing sets? (1 point)
Explain what overfitting is. (the video referenced in part 2 may also be useful here) (1 point)
Make a scatter plot of some of the indian-diabetes data using distinct types of points (for example distinct colors) for each class (use KNIME or Python). Find two features (columns) that visually do a good job of separating the data - 2 points. Submit screen shot.
Describe K-Nearest Neighbor algorithm in words (submit word doc or e-mail) -1 points
Perform classification with KNIME k-nearest neighbors and KNIME Naive Bayes algorithm on the dating dataset and titanic dataset remember to divide into training and testing (use the partitioner) - 4 points

For the dating dataset “did_she_like” is the class. Report accuracy of the algorithm For the titanic dataset “survived” is the class Report accuracy of the algorithm

For both embed a screenshot of the workflow
Perform classification with Scikitlearn k-nearest neighbors on the dating and titanic dataset, remember to divide into training and testing- 4 points
In class we have sometimes been testing our algorithms on extremely simple datasets. Datasets so simple that a human can see what the answer is right away. Why is it useful to test an algorithm on a very simple dataset?? (1 point) (See python/machineLearning/DiabetesMachineLearning.ipynb)
We have two classes of drink. We know they are either Cola or Coffee. We measure the following features
Temperature
% carbonation

Is this a good design?? When will a classifier based on these features succeed?? When will it fail?? (1 point)