Background

In the modern world having good information can be considered as the life force of business. Machine learning is one of the key techniques which enables to identify useful data patterns and extract other information from raw data. With the rapid growth of data, ability to handle abundant data using simple machine learning methods is challenged in terms of performance.

To prevent the drawback large-scale machine learning systems are being introduced. main features of large-scale machine learning systems are;

Scalability
Dynamic scheduling
Parallel execution
Fault tolerance
Load balancing

Apache Spark and Apache Mhout are some of the projects which provide large scale machine learning capabilities

Aim

This is a project to test an alternative method to scale the system by creating a model using partitioned training data using a histogram, and applying logistic regression for each partition. Histogram will be used to partition the data. Elaborating more on the aim,

Logistic regression will be run using the full data set
Logistic regression will be run using training data which will be randomly categorized
Logistic regression will be run on the partitioned training data set using histogram
The output of all the above will be compared and evaluated against each other

Existing work

Sparks provides a powerful library implementation (MLlib) of some common machine learning functionality.

http://spark.apache.org/docs/0.9.1/mllib-guide.html

Apache Mahout gives;

Scalable to large data sets

Scalable to support your business case

Scalable community

http://mahout.apache.org/

Kala's World :)

Thursday, February 12, 2015

Test scaling Logistic regression (ML classification algorithm) by partitioning data by Histogram Introduction

Background

Aim

Existing work

No comments:

Post a Comment