PySpark MLlib: Machine Learning at Big Data Scale
PySpark MLlib is Apache Spark’s machine learning library ÔÇö built for big data. It provides ML algorithms (classification, regression, clustering, collaborative filtering) that scale across distributed clusters. This guide covers core concepts and practical examples.
Complete Advance AI topics:-
Complete Python Course:-
Core ML Concepts
- Classification: Binary, multiclass ÔÇö Decision Trees, Random Forest, Naive Bayes.
- Clustering: K-Means, Gaussian Mixture, Hierarchical.
- Frequent Pattern Mining: Itemset analysis, market basket.
- Linear Algebra Utilities: mllib.linalg for matrix operations.
- Recommendation Systems: Collaborative filtering with ALS.
- Regression: Linear, logistic ÔÇö predict continuous outcomes.
Features
- Feature extraction from raw inputs.
- Feature transformation (scaling, encoding).
- Feature selection for optimal models.
- Locality Sensitive Hashing (LSH).
Example 1: Linear Regression
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
spark = SparkSession.builder.appName("LinReg").getOrCreate()
df = spark.read.csv("Ecommerce-Customers.csv", header=True, inferSchema=True)
assembler = VectorAssembler(
inputCols=["Avg Session Length", "Time on App", "Time on Website"],
outputCol="features"
)
data = assembler.transform(df).select("features", "Yearly Amount Spent")
regressor = LinearRegression(featuresCol="features", labelCol="Yearly Amount Spent")
model = regressor.fit(data)
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
Example 2: K-Means Clustering
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
dataset = spark.read.format("libsvm").load("Iris.csv")
kmeans = KMeans().setK(3).setSeed(1)
model = kmeans.fit(dataset)
predictions = model.transform(dataset)
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette Score: {silhouette}")
Example 3: Collaborative Filtering (ALS)
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
ratings = spark.read.csv("MovieLens.csv", header=True, inferSchema=True)
training, test = ratings.randomSplit([0.8, 0.2])
als = ALS(
maxIter=5, regParam=0.01,
userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop"
)
model = als.fit(training)
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
print("RMSE:", evaluator.evaluate(predictions))
# Top 10 recommendations per user
model.recommendForAllUsers(10).show()
Download New Real Time Projects:- Click here
Conclusion
PySpark MLlib brings machine learning to big data ÔÇö from linear regression to recommendation systems. If you work with datasets too large for scikit-learn, MLlib is your tool. For more guides, stay tuned to .
pyspark mllib example
pyspark mllib tutorial
pyspark ml vs mllib
pyspark ml pipeline
pyspark ml models
pyspark ml transformer
pyspark deep learning
pyspark mllib w3schools