Background
In the modern world having good information can be considered as the life force of business. Machine learning is one of the key techniques which enables to identify useful data patterns and extract other information from raw data. With the rapid growth of data, ability to handle abundant data using simple machine learning methods is challenged in terms of performance.
To prevent the drawback large-scale machine learning systems are being introduced. main features of large-scale machine learning systems are;
- Scalability
- Dynamic scheduling
- Parallel execution
- Fault tolerance
- Load balancing
Apache Spark and Apache Mhout are some of the projects which provide large scale machine learning capabilities
Aim
This is a project to test an alternative method to scale the system by creating a model using partitioned training data using a histogram, and applying logistic regression for each partition. Histogram will be used to partition the data. Elaborating more on the aim,
- Logistic regression will be run using the full data set
- Logistic regression will be run using training data which will be randomly categorized
- Logistic regression will be run on the partitioned training data set using histogram
- The output of all the above will be compared and evaluated against each other
Existing work
Sparks provides a powerful library implementation (MLlib) of some common machine learning functionality.
Apache Mahout gives;
- Scalable to large data sets
- Scalable to support your business case
- Scalable community
No comments:
Post a Comment