Thursday, February 12, 2015

Test scaling Logistic regression (ML classification algorithm) by partitioning data by Histogram Introduction

Background


In the modern world having good information can be considered as the life force of business. Machine learning is one of the key techniques which enables to identify  useful data patterns and extract other information from raw data. With the rapid growth of data, ability to handle abundant data using simple machine learning methods is  challenged in terms of performance.
To prevent the drawback large-scale machine learning systems are being introduced. main features of large-scale machine learning systems are;
  • Scalability
  • Dynamic scheduling
  • Parallel execution
  • Fault tolerance
  • Load balancing
Apache Spark and Apache Mhout are some of the projects which provide large scale machine learning capabilities  

Aim

This is a project to test an alternative method to scale the system by creating a model using partitioned training data using a histogram, and applying logistic regression for each partition. Histogram will be used to partition the data. Elaborating more on the aim,
  • Logistic regression will be run using the full data set
  • Logistic regression will be run using training data which will be randomly categorized
  • Logistic regression will be run on the partitioned training data set using histogram
  • The output of all the above will be compared and evaluated against each other


Existing work

Sparks provides a powerful library implementation (MLlib) of some common machine learning functionality.
Apache Mahout gives;
  • Scalable to large data sets
  • Scalable to support your business case
  • Scalable community

No comments:

Post a Comment