Chapter 9.9: Unsupervised or Semi-Supervised Feature Learning
1) Hand-crafted Convolution Kernels
Before modern deep learning, visual features were extracted using manually designed filters such as Sobel, Laplacian, Gaussian blur, sharpening, and emboss kernels. These filters capture simple patterns—edges, gradients, corners, and blobs—based on human intuition about what matters in images. They work reasonably well for low-level vision tasks but cannot automatically adapt to new data distributions or complex visual concepts.
2) Unsupervised Feature Learning
To avoid manually designing all filters, researchers explored unsupervised or semi-supervised methods that automatically learn useful features from unlabeled data. Approaches such as sparse coding, autoencoders, k-means feature extraction, and clustering-based templates attempted to replace hand-crafted kernels by discovering edge-like or texture-like patterns directly from raw images without labels. These methods were effective for simple patterns and could reduce the need for labeled data, but they were limited in scale, depth, and representational power.
3) Why Modern Systems Still Rely on CNNs to Learn Features
Although hand-crafted kernels and unsupervised feature learning provided meaningful progress, they cannot match the expressiveness, depth, and adaptability of convolutional neural networks. CNNs learn hierarchical features end-to-end—from simple edges to complex textures, shapes, and object parts—optimized directly for the final task. CNNs scale to large datasets, capture richer invariances, and consistently outperform manually designed or shallow unsupervised methods. As a result, feature extraction today is dominated by CNNs (or even more powerful architectures like Vision Transformers), making handcrafted filters and early unsupervised methods largely obsolete except for instructional or specialized uses.
Figure: Classic hand-crafted convolution kernels including Sobel (edge detection), Laplacian (corners and edges), Gaussian blur (smoothing), sharpening, and emboss filters. These filters were the foundation of computer vision before deep learning, capturing simple patterns based on human intuition about visual features.
Key Insight
The evolution from hand-crafted to learned features represents a fundamental shift in computer vision. Hand-crafted kernels like Sobel and Laplacian were limited to simple, predefined patterns. Unsupervised learning methods (sparse coding, autoencoders, k-means) attempted to discover features automatically but lacked depth and scalability. Modern CNNs surpass both by learning hierarchical features end-to-end—from low-level edges to high-level semantic concepts—optimized directly for the task at hand. This adaptability and representational power explain why CNNs became the dominant paradigm in visual feature extraction.