Machine Learning – Shishir Kant Singh

Python CGI Programming

Shishir Kant Singh — Thu, 28 Sep 2023 15:11:08 +0000

The Concept of CGI

CGI is an abbreviation for Common Gateway Interface. It is not a type of language but a set of rules (specification) that establishes a dynamic interaction between a web application and the client application (or the browser). The programs based on CGI helps in communicating between the web servers and the client. Whenever the client browser makes a request, it sends it to the webserver, and the CGI programs return output to the webserver based on the input that the client-server provides.

Common Gateway Interface (CGI) provides a standard for peripheral gateway programs to interface with the data servers like an HTTP server.

The programming with CGI is written dynamically, which generates web-pages responding to the input from the user or the web-pages interacting with the software on the server.

The Concept of Web Browsing

Have you ever wondered how these blue-colored underlined texts, commonly known as hyperlinks, able to take you from one web-page or Uniform Resource Locator (URL) to another? What exactly happens when some user clicks on a hyperlink?

Let’s understand the very concept behind web browsing. Web browsing consists of some steps that are as follows:

STEP 1: Firstly, the browser communicates with the data server, say HTTP server, to demand the URL.

STEP 2: Once it is done, then it parses the URL.

STEP 3: After then, it checks for the specified filename.

STEP 4: Once it finds that file, a request is made and sent it back.

STEP 5: The Web browser accepts a response from the webserver.

STEP 6: As the server’s response, it can either display the requested file or a message showing error.

However, it may be possible to set up an HTTP server because whenever a file in a specific directory is requested, it is processed as a program rather than sending that file back. The output of that program is shown back to the browser. This function is also known as the Common Gateway Interface or abbreviated as CGI. These processed programs are known as CGI scripts, and they can be a C or C++ program, Shell Script, PERL Script, Python Script, etc.

The working of CGI

Whenever the client-server requests the webserver, the Common Gateway Interface (CGI) handles these requests using external script files. These script files can be written in any language, but the chief idea is to recover the data more efficiently and quickly. These scripts are then used to convert the recovered data into an HTML format that can send data to these web servers in HTML formatted page.

An architectural diagram representing the working of CGI is shown below:

Usage of cgi module

Python provides the cgi module consisting of numerous useful core properties. These properties and functions can be used by importing the cgi module, in current working program as shown below:

import cgi

Now, We will use cgitb.enable() in our script to stimulate an exception handler in the web browser to display the detailed report for the errors that occurred. The save will look as shown below:

import cgi

cgitb.enable()

Now, we can save the report with the help of the following script.

import cgitb 
cgitb.enable(display = 0, logdir = “/path/to/logdir” )

The function of the cgi module stated above would help throughout the script development. These reports help the user for debugging the script efficiently. Whenever the users get the expected result, they can eliminate this.

As we have discussed earlier, we can save information with the help of the form. But the problem is, how can we obtain that information? To answer this question, let’s understand the FieldStorage class of Python. If the form contains the non-ASCII characters, we can apply the encoding keyword parameter to the document. We will find the content tag inside the section of the HTML file.

The FieldStorage class is used to read the form data from the environment or the standard input.

A FieldStorage instance is similar to the Python dictionary. The user can utilize the len() and all the dictionary functions as the FieldStorage instance. It is used to overlook the fields that have values as an empty string. The users can also consider the void values with the optional keyword parameter keep_blank_values by setting it to True.

Let’s see an example:

form = cgi.FieldStorage()   if ("name" not in form or "add" not in form):       
    print("Input Error!!")
   print("Please enter the details in the Name and Address fields!")    return 
print("Name: file_item = form["userfile"]   
if (fileitem.file):      
    # It represents the uploaded file     
    count_line = 0       
    while(True):           
        line = fileitem.file.readline()   
        if not line: break           
        count_line = count_line + 1   # The execution of next lines of code will start here...

In the above snippet of code, we have utilized the form [“name”], where name is key, for extracting the value which the user enters.

To promptly fetch the string value, we can utilize the getvalue() method. This method also takes a second argument by default. And if the key is not available, it will return the value as default.

Moreover, if the information in the submitted form has multiple fields with the same name, we should take the help of the form.getlist() function. This function helps in returning the list of strings. Now let’s look at the following snippet of code; we have added some username fields and separate them by commas.

first_value = form.getlist("username")   f_username = ",".join(value)

If we want to access the field where a file is uploaded and read that in bytes, we can use the value attribute or the getvalue() method. Let’s see the following snippet of code if the user uploads the file.

file_item = form["userfile"]   
if (fileitem.file):      
    # It represents the uploaded file     
    count_line = 0       
    while(True):           
        line = fileitem.file.readline()           if not line: break           
        count_line = count_line + 1

An error can often interrupt the program while reading the content of the file that was uploaded. It may happen when a user clicks on the Back Button or the Cancel Button. However, to set the value – 1, the FieldStorage class provides the done attribute.

Furthermore, the item will be objects of the MiniFieldStorage class if the submitted form is in the “old” format. The attributes like list, filename, and file are always None in this class.

Usually, the form is submitted with the POST method’s help and contains a query string with both the MiniFieldStorage and FieldStorage items.

Let’s see a list of the FieldStorage attribute in the following table.

FieldStorage Attributes:

S. No.	Attributes	Description
1	*Name*	The Name attribute is used to represent the field name.
2	*File*	The File attribute is used as a file(-like) instance to read data as bytes.
3	*Filename*	The Filename attribute is used to represent the filename at the Client-side.
4	*Type*	The Type attribute is used to show the type of content.
5	*Value*	The Value attribute is used to upload files, read the files and return byte. It is a string type value.
6	*Header*	The Header attribute is used as a dictionary type instance containing all headers.

In addition to the above, the FieldStorage instance uses various core methods for manipulating users’ data. Some of them are listed below:

FieldStorage Methods:

S. No.	Methods	Description
1	*getfirst()*	The getfirst() method is used to return the received first value.
2	*getvalue()*	The getvalue() method is used as a dictionary get() method
3	*getlist()*	The getlist() method is used to return the list of values received.
4	*keys()*	The keys() method is used as the dictionary keys() method
5	*make_file()*	The make_file() method is used to return a writable and readable file.

CGI Program Structure in Python

Let’s understand the structure of a Python CGI Program:

There must be two sections that separate the output of the Python CGI script by a blank line.
The first section consists of the number of headers describing the client about the type of data used, and the other section consists of the data that will be displayed during the execution of the script.

Let’s have a look at the Python code given below:

print ("Content-type : text/html") 
# now enter the rest html document print ("") 
print ("") 
print (" Welcome to CGI program ") 
print ("") 
print ("") 
print (" Hello World! This is my first CGI program. ") print ("") 
print ("")

Now, let’s save the above file as cgi.py. Once we execute the file, we should see an output, as shown below:

Hello World! This is my first CGI program.

The above program is a simple Python script that writes the output to STDOUT file that is on-screen.

Understanding the HTTP Header

There are various HTTP headers defined that are frequently used in the CGI programs. Some of them are listed below:

S. No.	HTTP Header	Description
1	*Content-type*	The Content-type is a MIME string used for defining the file format that is being returned.
2	*Content-length: N*	The Content-length works as the information used for reporting the estimated time for downloading a file.
3	*Expires: Date*	The Expires: Date is used for displaying the valid date information
4	*Last-modified: Date*	The Last-modified: Date is used to show the resource’s last modification date
5	*Location: URL*	The Location: URL is used to display the URL returned by the server.
6	*Set-Cookies: String*	The Set-Cookies: String is used for setting the cooking with help of a string

The CGI Environment Variables

There are some variables predefined in the CGI environment alongside the HTML syntax. Some of them are listed in the following table:

S. No.	Environment Variables	Description
1	*CONTENT_TYPE*	The CONTENT_TYPE variable is used to describe the type and data of the content.
2	*CONTENT_LENGTH*	The CONTENT_LENGTH variable is used to define the query or information length.
3	*HTTP_COOKIE*	The HTTP_COOKIE variable is used to return the cookie set by the user in the current session.
4	*HTTP_USER_AGENT*	The HTTP_USER_AGENT variable is used for displaying the browser’s type currently being used by the user.
5	*REMOTE_HOST*	The REMOTE_HOST variable is used for describing the Host-name of the user.
6	*PATH_INFO*	The PATH_INFO variable is used for describing the CGI script path.
7	*REMOTE_ADDR*	The REMOTE_ADDR variable is used for defining the IP address of the visitor.
8	*REQUEST_METHOD*	The REQUEST_METHOD variable is used for requests with the help of the GET or POST method.

How to Debug CGI Scripts?

Before we start debugging CGI Scripts, we need to check the error in trivial installation. It often occurs while installing the CGI script. It is recommended to follow the installation instructions and install a replica of the module file – cgi.py as a CGI script.

Then, the test() function can be used from the script. We can write the following code using a single statement

cgi.test()

Pros and Cons of CGI Programming

Some Pros of CGI Programming:

There are numerous pros of using CGI programming. Some of them are as follows:

The CGI programs are multi-lingual. These programs can be used with any programming language.
The CGI programs are portable and can work on almost any web-server.
The CGI programs are quite scalable and can perform any task, whether it’s simple or complex.
The CGIs take lesser time in processing requests.
The CGIs can be used in development; they can reduce the cost of developments and maintenance, making it profitable.
The CGIs can be used for increasing the dynamic communication in web applications.

Some Cons of CGI Programming:

There are a few cons of using CGI programming. Some of them are as follows:

The CGI programs are pretty much complex, making it harder to debug.
While initiating the program, the interpreter has to appraise a CGI script in every initiation. As an output, it creates a lot of traffic because of multiple requests from the client-server’s side.
The CGI programs are fairly susceptible, as they are mostly free and easily available with no server security.
CGI utilizes a lot of time in processing.
The data doesn’t store in the cache memory during the loading of the page.
CGIs have huge extensive codebases, mostly in Perl.

Python Regex Functions

Shishir Kant Singh — Thu, 28 Sep 2023 15:02:27 +0000

A regular expression is a set of characters with highly specialized syntax that we can use to find or match other characters or groups of characters. In short, regular expressions, or Regex, are widely used in the UNIX world.

Import the re Module

# Importing re module
import re

The re-module in Python gives full support for regular expressions of Pearl style. The re module raises the re.error exception whenever an error occurs while implementing or using a regular expression.

We’ll go over crucial functions utilized to deal with regular expressions.

But first, a minor point: many letters have a particular meaning when utilized in a regular expression called metacharacters.Backward Skip 10sPlay VideoForward Skip 10s

The majority of symbols and characters will easily match. (A case-insensitive feature can be enabled, allowing this RE to match Python or PYTHON.) For example, the regular expression ‘check’ will match exactly the string ‘check’.

There are some exceptions to this general rule; certain symbols are special metacharacters that don’t match. Rather, they indicate that they must compare something unusual or have an effect on other parts of the RE by recurring or modifying their meaning.

Metacharacters or Special Characters

As the name suggests, there are some characters with special meanings:

Characters	Meaning
.	Dot – It matches any characters except the newline character.
^	Caret – It is used to match the pattern from the start of the string. (Starts With)
$	Dollar – It matches the end of the string before the new line character. (Ends with)
*	Asterisk – It matches zero or more occurrences of a pattern.
+	Plus – It is used when we want a pattern to match at least one.
?	Question mark – It matches zero or one occurrence of a pattern.
{}	Curly Braces – It matches the exactly specified number of occurrences of a pattern
[]	Bracket – It defines the set of characters
\|	Pipe – It matches any of two defined patterns.

Special Sequences:

The ability to match different sets of symbols will be the first feature regular expressions can achieve that’s not previously achievable with string techniques. On the other hand, Regexes isn’t much of an improvement if that had been their only extra capacity. We can also define that some sections of the RE must be reiterated a specified number of times.

The first metacharacter we’ll examine for recurring occurrences is *. Instead of matching the actual character ‘*,’ * signals that the preceding letter can be matched 0 or even more times rather than exactly once.

Ba*t, for example, matches ‘bt’ (zero ‘a’ characters), ‘bat’ (one ‘a’ character), ‘baaat’ (three ‘a’ characters), etc.

Greedy repetitions, such as *, cause the matching algorithm to attempt to replicate the RE as many times as feasible. If later elements of the sequence fail to match, the matching algorithm will retry with lesser repetitions.

Special Sequences consist of ‘\’ followed by a character listed below. Each character has a different meaning.

Character	Meaning
\d	It matches any digit and is equivalent to [0-9].
\D	It matches any non-digit character and is equivalent to [^0-9].
\s	It matches any white space character and is equivalent to [\t\n\r\f\v]
\S	It matches any character except the white space character and is equivalent to [^\t\n\r\f\v]
\w	It matches any alphanumeric character and is equivalent to [a-zA-Z0-9]
\W	It matches any characters except the alphanumeric character and is equivalent to [^a-zA-Z0-9]
\A	It matches the defined pattern at the start of the string.
\b	r”\bxt” – It matches the pattern at the beginning of a word in a string. r”xt\b” – It matches the pattern at the end of a word in a string.
\B	This is the opposite of \b.
\Z	It returns a match object when the pattern is at the end of the string.

RegEx Functions:

compile – It is used to turn a regular pattern into an object of a regular expression that may be used in a number of ways for matching patterns in a string.
search – It is used to find the first occurrence of a regex pattern in a given string.
match – It starts matching the pattern at the beginning of the string.
fullmatch – It is used to match the whole string with a regex pattern.
split – It is used to split the pattern based on the regex pattern.
findall – It is used to find all non-overlapping patterns in a string. It returns a list of matched patterns.
finditer – It returns an iterator that yields match objects.
sub – It returns a string after substituting the first occurrence of the pattern by the replacement.
subn – It works the same as ‘sub’. It returns a tuple (new_string, num_of_substitution).
escape – It is used to escape special characters in a pattern.
purge – It is used to clear the regex expression cache.

1. re.compile(pattern, flags=0)

It is used to create a regular expression object that can be used to match patterns in a string.

Example:

# Importing re module  
import re  

# Defining regEx pattern  
pattern = "amazing"  

# Createing a regEx object
regex_object = re.compile(pattern)  

# String  
text = "This tutorial is amazing!"   

# Searching for the pattern in the string  
match_object = regex_object.search(text)  

# Output  
print("Match Object:", match_object)  


Output:
Match Object:

This is equivalent to:

re_obj = re.compile(pattern)
result = re_obj.search(string)

result = re.search(pattern, string)

Note – When it comes to using regular expression objects several times, the re.complie() version of the program is much more efficient.

2. re.match(pattern, string, flags=0)

It starts matching the pattern from the beginning of the string.
Returns a match object if any match is found with information like start, end, span, etc.
Returns a NONE value in the case no match is found.

Parameters

pattern:-this is the expression that is to be matched. It must be a regular expression
string:-This is the string that will be compared to the pattern at the start of the string.
flags:-Bitwise OR (|) can be used to express multiple flags.

Example:

# Importing re module  
import re  

# Our pattern  
pattern = "hello"  

# Returns a match object if found else Null  
match = re.match(pattern, "hello world")  

print(match) # Printing the match object  
print("Span:", match.span()) # Return the tuple (start, end)  
print("Start:", match.start()) # Return the starting index  
print("End:", match.end()) # Returns the ending index  

Output: 
Span: (0, 5) 
Start: 0 
End: 5

Another example of the implementation of the re.match() method in Python.

The expressions “.w*” and “.w*?” will match words that have the letter “w,” and anything that does not has the letter “w” will be ignored.
The for loop is used in this Python re.match() illustration to inspect for matches for every element in the list of words.

CODE:

import re    
line = "Learn Python through tutorials on shishirkant"  
match_object = re.match( r'.w* (.w?) (.w*?)', line, re.M|re.I)

if match_object:    
    print ("match object group : ", match_object.group())   
    print ("match object 1 group : ", match_object.group(1))
    print ("match object 2 group : ", match_object.group(2))  
else:    
    print ( "There isn't any match!!" )   

Output:
There isn't any match!!

3. re.search(pattern, string, flags=0)

The re.search() function will look for the first occurrence of a regular expression sequence and deliver it. It will verify all rows of the supplied string, unlike Python’s re.match(). If the pattern is matched, the re.search() function produces a match object; otherwise, it returns “null.”

To execute the search() function, we must first import the Python re-module and afterward run the program. The “sequence” and “content” to check from our primary string are passed to the Python re.search() call.

Here is the description of the parameters –

pattern:- this is the expression that is to be matched. It must be a regular expression

string:- The string provided is the one that will be searched for the pattern wherever within it.

flags:- Bitwise OR (|) can be used to express multiple flags. These are modifications, and the table below lists them.

Code

import re  

line = "Learn Python through tutorials on shishirkant";  

search_object = re.search( r' .*t? (.*t?) (.*t?)', line) 
if search_object:  
    print("search object group : ", search_object.group())  
    print("search object group 1 : ", search_object.group(1)) 
    print("search object group 2 : ", search_object.group(2)) 
else:  
    print("Nothing found!!")  

Output:
search object group : Python through tutorials on shishirkant 
search object group 1 : on 
search object group 2 : shishirkant

4. re.sub(pattern, repl, string, count=0, flags=0)

It substitutes the matching pattern with the ‘repl’ in the string
Pattern – is simply a regex pattern to be matched
repl – repl stands for “replacement” which replaces the pattern in string.
Count – This parameter is used to control the number of substitutions

Example 1:

# Importing re module  
import re  

# Defining parameters  
pattern = "like" # to be replaced  
repl = "love" # Replacement  
text = "I like Shishirkant!" # String 

# Returns a new string with a substituted pattern 
new_text = re.sub(pattern, repl, text)  

# Output  
print("Original text:", text)  
print("Substituted text: ", new_text)  

Output:
Original text: I like Shishirkant! 
Substituted text: I love Shishirkant!

In the above example, the sub-function replaces the ‘like’ with ‘love’.

Example 2 – Substituting 3 occurrences of a pattern.

# Importing re package  
import re  

# Defining parameters  
pattern = "l" # to be replaced  
repl = "L" # Replacement  
text = "I like Shishirkant! I also like tutorials!" # String  

# Returns a new string with the substituted pattern  
new_text = re.sub(pattern, repl, text, 3)  

# Output  
print("Original text:", text)  
print("Substituted text:", new_text)  

Output:
Original text: I like Shishirkant! I also like tutorials! 
Substituted text: I Like Shishirkant! I aLso Like tutorials!

Here, first three occurrences of ‘l’ is substituted with the “L”.

5. re.subn(pattern, repl, string, count=0, flags=0)

Working of subn if same as sub-function
It returns a tuple (new_string, num_of_substitutions)

Example:

# Importing re module  
import re  

# Defining parameters  
pattern = "l" # to be replaced  
repl = "L" # Replacement  
text = "I like Shishirkant! I also like tutorials!" # String  

# Returns a new string with the substituted pattern  
new_text = re.subn(pattern, repl, text, 3)  

# Output  
print("Original text:", text)  
print("Substituted text:", new_text) 

Output:
Original text: I like Shishirkant! I also like tutorials! 
Substituted text: ('I Like Shishirkant! I aLso Like tutorials!', 3)

In the above program, the subn function replaces the first three occurrences of ‘l’ with ‘L’ in the string.

6. re.fullmatch(pattern, string, flags=0)

It matches the whole string with the pattern.
Returns a corresponding match object.
Returns None in case no match is found.
On the other hand, the search() function will only search the first occurrence that matches the pattern.

Example:

# Importing re module  
import re   
# Sample string  
line = "Hello world";    

# Using re.fullmatch()  
print(re.fullmatch("Hello", line))  
print(re.fullmatch("Hello world", line))

Output:

None

In the above program, only the ‘Hello world” has completely matched the pattern, not ‘Hello’.

Q. When to use re.findall()?

Ans. Suppose we have a line of text and want to get all of the occurrences from the content, so we use Python’s re.findall() function. It will search the entire content provided to it.

7. re.finditer(pattern, string, flags=0)

Returns an iterator that yields all non-overlapping matches of pattern in a string.
String is scanned from left to right.
Returning matches in the order they were discovered

# Importing re module  
import re   

# Sample string  
line = "Hello world. I am Here!";  

# Regex pattern  
pattern = r'[aeiou]'  

# Using re.finditer()  
iter_ = re.finditer(pattern, line)  

# Iterating the itre_  
for i in iter_:  
    print(i)

Output:

8. re.split(pattern, string, maxsplit=0, flags=0)

It splits the pattern by the occurrences of patterns.
If maxsplit is zero, then the maximum number of splits occurs.
If maxsplit is one, then it splits the string by the first occurrence of the pattern and returns the remaining string as a final result.

Example:

# Import re module  
import re    

# Pattern  
pattern = ' '  

# Sample string  
line = "Learn Python through tutorials on shishirkant"    

# Using split function to split the string after ' '  
result = re.split( pattern, line)   

# Printing the result  
print("When maxsplit = 0, result:", result)  

# When Maxsplit is one  
result = re.split(pattern, line, maxsplit=1)  
print("When maxsplit = 1, result =", result)

Output:
When maxsplit = 0, result: ['Learn', 'Python', 'through', 'tutorials', 'on', 'shishirkant'] 
When maxsplit = 1, result = ['Learn', 'Python through tutorials on shishirkant']

9. re.escape(pattern)

It escapes the special character in the pattern.
The esacpe function become more important when the string contains regular expression metacharacters in it.

Example:

# Import re module  
import re    

# Pattern  
pattern = 'https://www.shishirkant.com/'  

# Using escape function to escape metacharacters  
result = re.escape( pattern)   
  
# Printing the result  
print("Result:", result)

Output:Result: https://www\.shishirkant\.com/

The escape function escapes the metacharacter ‘.’ from the pattern. This is useful when want to treat metacharacters as regular characters to match the actual characters themselves.

10. re.purge()

The purge function does not take any argument that simply clears the regular expression cache.

Example:

# Importing re module  
import re  

# Define some regular expressions  
pattern1 = r'\d+'  
pattern2 = r'[a-z]+'  

# Use the regular expressions  
print(re.search(pattern1, '123abc'))  
print(re.search(pattern2, '123abc'))  

# Clear the regular expression cache  
re.purge()  

# Use the regular expressions again  
print(re.search(pattern1, '456def'))  
print(re.search(pattern2, '456def'))

Output:

After using, pattern1 and pattern2 to search for matches in the string ‘123abc’.
We have cleared the cache using re.purge().
We have again used pattern1 and pattern2 to search for matches in the string ‘456def’.
Since the regular expression cache has been cleared. The regular expressions are recompiled, and searching for matches in the ‘456def’ has been performed with the new regular expression object.

Matching Versus Searching – re.match() vs. re.search()

Python has two primary regular expression functions: match and search. The match function looks for a match only where the string starts, whereas the search function looks for a match everywhere in the string.

CODE:

# Import re module  
import re

# Sample string  
line = "Learn Python through tutorials on shishirkant" 

# Using match function to match 'through'
match_object = re.match( r'through', line, re.M|re.I)
if match_object:    
    print("match object group : ", match_object)    
else:    
    print( "There isn't any match!!")    

# using search function to search  
search_object = re.search( r'through', line, re.M|re.I)    
if search_object:    
    print("Search object group : ", search_object)    
else:    
    print("Nothing found!!")  

Output:
There isn't any match!! 
Search object group :

The match function checks whether the string is starting with ‘through’ or not, and the search function checks whether there is ‘through’ in the string or not.

Python Regular Expressions – I

Shishir Kant Singh — Thu, 28 Sep 2023 06:06:58 +0000

Introduction to the Python regular expressions

Regular expressions (called regex or regexp) specify search patterns. Typical examples of regular expressions are the patterns for matching email addresses, phone numbers, and credit card numbers.

Regular expressions are essentially a specialized programming language embedded in Python. And you can interact with regular expressions via the built-in re module in Python.

The following shows an example of a simple regular expression:

'\d'

Code language: Python (python)

In this example, a regular expression is a string that contains a search pattern. The '\d' is a digit character set that matches any single digit from 0 to 9.

Note that you’ll learn how to construct more complex and advanced patterns in the next tutorials. This tutorial focuses on the functions that deal with regular expressions.

To use this regular expression, you follow these steps:

First, import the re module:

import re

Second, compile the regular expression into a Pattern object:

p = re.compile('\d')

Third, use one of the methods of the Pattern object to match a string:

s = "Python 3.10 was released on October 04, 2021" 
result = p.findall(s) 

print(result)

Output:

['3', '1', '0', '0', '4', '2', '0', '2', '1']

The findall() method returns a list of single digits in the string s.

The following shows the complete program:

import re 

p = re.compile('\d') 
s = "Python 3.10 was released on October 04, 2021" 

results = p.findall(s) 
print(results)

Besides the findall() method, the Pattern object has other essential methods that allow you to match a string:

Method	Purpose
`match()`	Find the pattern at the beginning of a string
`search()`	Return the first match of a pattern in a string
`findall()`	Return all matches of a pattern in a string
`finditer()`	Return all matches of a pattern as an iterator

Python regular expression functions

Besides the Pattern class, the re module has some functions that match a string for a pattern:

match()
search()
findall()
finditer()

These functions have the same names as the methods of the Pattern object. Also, they take the same arguments as the corresponding methods of the Pattern object. However, you don’t have to manually compile the regular expression before using it.

The following example shows the same program that uses the findall() function instead of the findall() method of a Pattern object:

import re 

s = "Python 3.10 was released on October 04, 2021." 
results = re.findall('\d',s) 
print(results)

Using the functions in the re module is more concise than the methods of the Pattern object because you don’t have to compile regular expressions manually.

Under the hood, these functions create a Pattern object and call the appropriate method on it. They also store the compiled regular expression in a cache for speed optimization.

It means that if you call the same regular expression from the second time, these functions will not need to recompile the regular expression. Instead, they get the compiled regular expression from the cache.

Should you use the re functions or methods of the Pattern object?

If you use a regular expression within a loop, the Pattern object may save a few function calls. However, if you use it outside of loops, the difference is very little due to the internal cache.

The following sections discuss the most commonly used functions in the re module including search(), match(), and fullmatch().

search() function

The search() function searches for a pattern within a string. If there is a match, it returns the first Match object or None otherwise. For example:

import re 

s = "Python 3.10 was released on October 04, 2021." 
pattern = '\d{2}' 
match = re.search(pattern, s) 
print(type(match)) 
print(match)

Output:

In this example, the search() function returns the first two digits in the string s as the Match object.

Match object

The Match object provides the information about the matched string. It has the following important methods:

Method	Description
`group()`	Return the matched string
`start()`	Return the starting position of the match
`end()`	Return the ending position of the match
`span()`	Return a tuple (start, end) that specifies the positions of the match

The following example examines the Match object:

import re 

s = "Python 3.10 was released on October 04, 2021." 
result = re.search('\d', s) 

print('Matched string:',result.group()) 
print('Starting position:', result.start()) 
print('Ending position:',result.end()) print('Positions:',result.span())

Output:

Matched string: 3 
Starting position: 7 
Ending position: 8 
Positions: (7, 8)

match() function

The match() function returns a Match object if it finds a pattern at the beginning of a string. For example:

import re 

l = ['Python', 
     'CPython is an implementation of Python written in C', 
     'Jython is a Java implementation of Python', 
      'IronPython is Python on .NET framework'] 

pattern = '\wython' 
for s in l: 
    result = re.match(pattern,s) 
    print(result)

Output:

 
None 
 
None

In this example, the \w is the word character set that matches any single character.

The \wython matches any string that starts with any sing word character and is followed by the literal string ython, for example, Python.

Since the match() function only finds the pattern at the beginning of a string, the following strings match the pattern:

Python 
Jython is a Java implementation of Python

And the following string doesn’t match:

'CPython is an implementation of Python written in C' 
'IronPython is Python on .NET framework'

fullmatch() function

The fullmatch() function returns a Match object if the whole string matches a pattern or None otherwise. The following example uses the fullmatch() function to match a string with four digits:

import re 

s = "2021" 
pattern = '\d{4}' 
result = re.fullmatch(pattern, s) 
print(result)

Output:

(python)

The pattern '\d{4}' matches a string with four digits. Therefore, the fullmatch() function returns the string 2021.

If you place the number 2021 at the middle or the end of the string, the fullmatch() will return None. For example:

import re 

s = "Python 3.10 released in 2021" 
pattern = '\d{4}' 
result = re.fullmatch(pattern, s) 
print(result)

Output:

None

Regular expressions and raw strings

It’s important to note that Python and regular expression are different programming languages. They have their own syntaxes.

The re module is the interface between Python and regular expression programming languages. It behaves like an interpreter between them.

To construct a pattern, regular expressions often use a backslash '\' for example \d and \w . But this collides with Python’s usage of the backslash for the same purpose in string literals.

For example, suppose you need to match the following string:

s = '\section'

In Python, the backslash (\) is a special character. To construct a regular expression, you need to escape any backslashes by preceding each of them with a backslash (\):

pattern = '\\section'Code language: JavaScript (javascript)

In regular expressions, the pattern must be '\\section'. However, to express this pattern in a string literal in Python, you need to use two more backslashes to escape both backslashes again:

pattern = '\\\\section'Code language: JavaScript (javascript)

Simply put, to match a literal backslash ('\'), you have to write '\\\\' because the regular expression must be '\\' and each backslash must be expressed as '\\' inside a string literal in Python.

This results in lots of repeated backslashes. Hence, it makes the regular expressions difficult to read and understand.

A solution is to use the raw strings in Python for regular expressions because raw strings treat the backslash (\) as a literal character, not a special character.

To turn a regular string into a raw string, you prefix it with the letter r or R. For example:

import re 

s = '\section' 
pattern = r'\\section' 
result = re.findall(pattern, s) 

print(result)

Output:['\\section']

Note that in Python ‘\section’ and ‘\\section’ are the same:

p1 = '\\section' 
p2 = '\section' 

print(p1==p2) # true

In practice, you’ll find the regular expressions constructed in Python using the raw strings.

Summary

A regular expression is a string that contains the special characters for matching a string with a pattern.
Use the Pattern object or functions in re module to search for a pattern in a string.
Use raw strings to construct regular expression to avoid escaping the backslashes.

Text Processing in Machine Learning

Shishir Kant Singh — Sun, 29 Aug 2021 08:53:04 +0000

Text Processing is one of the most common task in many ML applications. Below are some examples of such applications.

• Language Translation: Translation of a sentence from one language to another.• Sentiment Analysis: To determine, from a text corpus, whether the  sentiment towards any topic or product etc. is positive, negative, or neutral.• Spam Filtering:  Detect unsolicited and unwanted email/messages.

These applications deal with huge amount of text to perform classification or translation and involves a lot of work on the back end. Transforming text into something an algorithm can digest is a complicated process. In this article, we will discuss the steps involved in text processing.

Step 1 : Data Preprocessing

Tokenization — convert sentences to words
Removing unnecessary punctuation, tags
Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic
Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix.
Lemmatization — Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.

The stemmed form of studies is: studi
The stemmed form of studying is: studyThe lemmatized form of studies is: study
The lemmatized form of studying is: study

Thus stemming & lemmatization help reduce words like ‘studies’, ‘studying’ to a common base form or root word ‘study’.

Note that not all the steps are mandatory and is based on the application use case. For Spam Filtering we may follow all the above steps but may not for language translation problem.

We can use python to do many text preprocessing operations.

NLTK — The Natural Language ToolKit is one of the best-known and most-used NLP libraries, useful for all sorts of tasks from t tokenization, stemming, tagging, parsing, and beyond
BeautifulSoup — Library for extracting data from HTML and XML documents

#using NLTK library, we can do lot of text preprocesing
import nltk
from nltk.tokenize import word_tokenize
#function to split text into word
tokens = word_tokenize("The quick brown fox jumps over the lazy dog")
nltk.download('stopwords')
print(tokens)

OUT: [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]

from nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’))
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

OUT: [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]

#NLTK provides several stemmer interfaces like Porter stemmer, #Lancaster Stemmer, Snowball Stemmer
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stems = []
for t in tokens:    
    stems.append(porter.stem(t))
print(stems)

OUT: [‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jump’, ‘lazi’, ‘dog’]

Step 2: Feature Extraction

In text processing, words of the text represent discrete, categorical features. How do we encode such data in a way which is ready to be used by the algorithms? The mapping from textual data to real valued vectors is called feature extraction. One of the simplest techniques to numerically represent text is Bag of Words.

Bag of Words (BOW): We make the list of unique words in the text corpus called vocabulary. Then we can represent each sentence or document as a vector with each word represented as 1 for present and 0 for absent from the vocabulary. Another representation can be count the number of times each word appears in a document. The most popular approach is using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.

Term Frequency (TF) = (Number of times term t appears in a document)/(Number of terms in the document)
Inverse Document Frequency (IDF) = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus having the effect of highlighting words that are distinct.
We calculate TF-IDF value of a term as = TF * IDF

Let us take an example to calculate TF-IDF of a term in a document.

Example text corpus

TF('beautiful',Document1) = 2/10, IDF('beautiful')=log(2/2) = 0
TF(‘day’,Document1) = 5/10,  IDF(‘day’)=log(2/1) = 0.30

TF-IDF(‘beautiful’, Document1) = (2/10)*0 = 0
TF-IDF(‘day’, Document1) = (5/10)*0.30 = 0.15

As, you can see for Document1 , TF-IDF method heavily penalizes the word ‘beautiful’ but assigns greater weight to ‘day’. This is due to IDF part, which gives more weightage to the words that are distinct. In other words, ‘day’ is an important word for Document1 from the context of the entire corpus. Python scikit-learn library provides efficient tools for text data mining and provides functions to calculate TF-IDF of text vocabulary given a text corpus.

One of the major disadvantages of using BOW is that it discards word order thereby ignoring the context and in turn meaning of words in the document. For natural language processing (NLP) maintaining the context of the words is of utmost importance. To solve this problem we use another approach called Word Embedding.

Word Embedding: It is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.

Let us discuss some of the well known models of word embedding:

Word2Vec

Word2vec takes as its input a large corpus of text and produces a vector space with each unique word being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. Word2Vec is very famous at capturing meaning and demonstrating it on tasks like calculating analogy questions of the form a is to b as c is to ?. For example, man is to woman as uncle is to ? (aunt) using a simple vector offset method based on cosine distance. For example, here are vector offsets for three word pairs illustrating the gender relation:

vector offsets for gender relation

This kind of vector composition also lets us answer “King — Man + Woman = ?” question and arrive at the result “Queen” ! All of which is truly remarkable when you think that all of this knowledge simply comes from looking at lots of word in context with no other information provided about their semantics.

Glove

The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the word2vec method for efficiently learning word vectors. GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.

Consider the following example:

Target words: ice, steam
Probe words: solid, gas, water, fashion

Let P(k|w) be the probability that the word k appears in the context of word w. Consider a word strongly related to ice, but not to steam, such as solid. P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Thus the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of P(gas | ice) / P(gas | steam) will instead be small. For a word related to both ice and steam, such as water we expect the ratio to be close to one.

Word embeddings encode each word into a vector that captures some sort of relation and similarity between words within the text corpus. This means even the variations of words like case, spelling, punctuation, and so on will be automatically learned. In turn, this can mean that some of the text cleaning steps described above may no longer be required.

Step 3: Choosing ML Algorithms

There are various approaches to building ML models for various text based applications depending on what is the problem space and data available.

Classical ML approaches like ‘Naive Bayes’ or ‘Support Vector Machines’ for spam filtering has been widely used. Deep learning techniques are giving better results for NLP problems like sentiment analysis and language translation. Deep learning models are very slow to train and it has been seen that for simple text classification problems classical ML approaches as well give similar results with quicker training time.

Let us build a Sentiment Analyzer over the IMDB movie review dataset using the techniques discussed so far.

Preprocessing

The dataset is structured as test set and training set of 25000 files each. Let us first read the files into a python dataframe for further processing and visualization. The test and training set are further divided into 12500 ‘positive’ and ‘negative’ reviews each. We read each file and label negative review as ‘0’ and positive review as ‘1’

#convert the dataset from files to a python DataFrameimport pandas as pd
import osfolder = 'aclImdb'labels = {'pos': 1, 'neg': 0}df = pd.DataFrame()for f in ('test', 'train'):    
    for l in ('pos', 'neg'):
        path = os.path.join(folder, f, l)
        for file in os.listdir (path) :
            with open(os.path.join(path, file),'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]],ignore_index=True)df.columns = ['review', 'sentiment']

Let us save the assembled data as .csv file for further use.

Five reviews and the corresponding sentiment

To get the frequency distribution of the words in the text, we can utilize the nltk.FreqDist() function, which lists the top words used in the text, providing a rough idea of the main topic in the text data, as shown in the following code:

import nltk
from nltk.tokenize import word_tokenizereviews = df.review.str.cat(sep=' ')#function to split text into word
tokens = word_tokenize(reviews)vocabulary = set(tokens)
print(len(vocabulary))frequency_dist = nltk.FreqDist(tokens)
sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)[0:50]

This gives the top 50 words used in the text, though it is obvious that some of the stop words, such as the, frequently occur in the English language.

Top 50 words

Look closely and you find lot of unnecessary punctuation and tags. By excluding single and two letter words the stop words like the, this, and, that take the top slot in the word frequency distribution plot shown below.

Let us remove the stop words to further cleanup the text corpus.

from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]

Top 50 words

This looks like a cleaned text corpus now and words like went, saw, movie etc. taking the top slots as expected.

Another helpful visualization tool wordcloud package helps to create word clouds by placing words on a canvas randomly, with sizes proportional to their frequency in the text.

from wordcloud import WordCloud
import matplotlib.pyplot as pltwordcloud = WordCloud().
generate_from_frequencies(frequency_dist)plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Building a Classifier

After cleanup, it is time to build the classifier to identify sentiment of each movie review. From the IMDb dataset, divide test and training sets of 25000 each:

X_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

scikit-learn provides some cool tools to do pre-processing on text. We use TfidTransformer to covert the text corpus into the feature vectors, we restrict the maximum features to 10000. .

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
print(train_vectors.shape, test_vectors.shape)

Training and Test set: 25K with 10K Features

There are many algorithms to choose from, we will use a basic Naive Bayes Classifier and train the model on the training set.

from sklearn.naive_bayes import MultinomialNBclf = MultinomialNB().fit(train_vectors, y_train)

Our Sentiment Analyzer is ready and trained. Now let us test the performance of our model on the test set to predict the sentiment labels.

from  sklearn.metrics  import accuracy_scorepredicted = clf.predict(test_vectors)print(accuracy_score(y_test,predicted))Output 0.791

Wow!!! Basic NB classifier based Sentiment Analyzer does well to give around 79% accuracy. You can try changing features vector length and varying parameters ofTfidTransformerto see the impact on the accuracy of the model.

Conclusion: We have discussed the text processing techniques used in NLP in detail. We also demonstrated the use of text processing and build a Sentiment Analyzer with classical ML approach achieved fairly good results.

Dendogram Method in Python

Shishir Kant Singh — Sun, 29 Aug 2021 08:46:29 +0000

Important Terms in Hierarchical Clustering

Linkage Methods

Suppose there are (a) original observations a[0],…,a[|a|−1] in cluster (a) and (b) original objects b[0],…,b[|b|−1] in cluster (b), then in order to combine these clusters we need to calculate the distance between two clusters (a) and (b). Say a point (d) exists that hasn’t been allocated to any of the clusters, we need to compute the distance between cluster (a) to (d) and between cluster (b) to (d).

Now clusters usually have multiple points in them that require a different approach for the distance matrix calculation. Linkage decides how the distance between clusters, or point to cluster distance is computed. Commonly used linkage mechanisms are outlined below:

Single Linkage — Distances between the most similar members for each pair of clusters are calculated and then clusters are merged based on the shortest distance
Average Linkage — Distance between all members of one cluster is calculated to all other members in a different cluster. The average of these distances is then utilized to decide which clusters will merge
Complete Linkage — Distances between the most dissimilar members for each pair of clusters are calculated and then clusters are merged based on the shortest distance
Median Linkage — Similar to the average linkage, but instead of using the average distance, we utilize the median distance
Ward Linkage — Uses the analysis of variance method to determine the distance between clusters
Centroid Linkage — Calculates the centroid of each cluster by taking the average of all points assigned to the cluster and then calculates the distance to other clusters using this centroid

These formulas for distance calculation is illustrated in Figure 1 below.

Figure 1. Distance formulas for Linkages mentioned above. Image Credit — Developed by the Author

Distance Calculation

Distance between two or more clusters can be calculated using multiple approaches, the most popular being Euclidean Distance. However, other distance metrics like Minkowski, City Block, Hamming, Jaccard, Chebyshev, etc. can also be used with hierarchical clustering. Figure 2 below outlines how hierarchical clustering is influenced by different distance metrics.

Figure 2. Impact of distance calculation and linkage on cluster formation. Image Credits: Image credit — GIF via Gfycat.

Dendrogram

A dendrogram is used to represent the relationship between objects in a feature space. It is used to display the distance between each pair of sequentially merged objects in a feature space. Dendrograms are commonly used in studying the hierarchical clusters before deciding the number of clusters appropriate to the dataset. The distance at which two clusters combine is referred to as the dendrogram distance. The dendrogram distance is a measure of if two or more clusters are disjoint or can be combined to form one cluster together.

Figure 3. Dendrogram a hierarchical clustering using Median as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook

Figure 4. Dendrogram of a hierarchical clustering using Average as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook

Figure 5. Dendrogram of a hierarchical clustering using Complete as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook

Cophenetic Coefficient

Figures 3, 4, and 5 above signify how the choice of linkage impacts the cluster formation. Visually looking into every dendrogram to determine which clustering linkage works best is challenging and requires a lot of manual effort. To overcome this we introduce the concept of Cophenetic Coefficient.

Imagine two Clusters, A and B with points A₁, A₂, and A₃ in Cluster A and points B₁, B₂, and B₃ in cluster B. Now for these two clusters to be well-separated points A₁, A₂, and A₃ and points B₁, B₂, and B₃ should be far from each other as well. Cophenet index is a measure of the correlation between the distance of points in feature space and distance on the dendrogram. It usually takes all possible pairs of points in the data and calculates the euclidean distance between the points. (remains the same, irrespective of which linkage algorithm we chose). It then computes the dendrogram distance at which clusters A & B combines. If the distance between these points increases with the dendrogram distance between the clusters then the Cophenet index is closer to 1.

Figure 6. Cophenet index of different Linkage Methods in hierarchical clustering. Image Credits — Developed by the Author using Jupyter Notebook

Deciding the Number of Clusters

There are no statistical techniques to decide the number of clusters in hierarchical clustering, unlike a K Means algorithm that uses an elbow plot to determine the number of clusters. However, one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance. Let’s take a look at the example below.

Figure 7. Dendrogram of hierarchical clustering using the average linkage method. Image Credits — Developed by the Author using Jupyter Notebook

Figure 7 illustrates the presence of 5 clusters when the tree is cut at a Dendrogram distance of 3. The general idea being, all 5 groups of clusters combines at a much higher dendrogram distance and hence can be treated as individual groups for this analysis. We can also verify the same using a silhouette index score.

Conclusion

Deciding the number of clusters in any clustering exercise is a tedious task. Since the commercial side of the business is more focused on getting some meaning out of these groups, it is important to visualize the clusters in a two-dimensional space and check if they are distinct from each other. This can be achieved via PCA or Factor Analysis. This is a widely used mechanism to present the final results to different stakeholders that makes it easier for everyone to consume the output.

Figure 8. Cluster visual of a hierarchical clustering using two different linkage techniques. Image Credits — Developed by the Author using Jupyter Notebook

Silhouette Coefficient Clustering

Shishir Kant Singh — Sun, 29 Aug 2021 08:42:27 +0000

We usually start with K-Means clustering. After going through several tutorials and Medium stories you will be able to implement k-means clustering easily. But as you implement it, a question starts to bug your mind: how can we measure its goodness of fit? Supervised algorithms have lots of metrics to check their goodness of fit like accuracy, r-square value, sensitivity, specificity etc. but what can we calculate to measure the accuracy or goodness of our clustering technique? The answer to this question is Silhouette Coefficient or Silhouette score.

Silhouette Coefficient:

Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.

1: Means clusters are well apart from each other and clearly distinguished.

0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.

-1: Means clusters are assigned in the wrong way.

Image by author

Silhouette Score = (b-a)/max(a,b)

where

a= average intra-cluster distance i.e the average distance between each point within a cluster.

b= average inter-cluster distance i.e the average distance between all clusters.

Calculating Silhouette Score

Importing libraries:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline

Generating some random data:

To run clustering algorithm we are generating 100 random points.

X= np.random.rand(50,2)
Y= 2 + np.random.rand(50,2)
Z= np.concatenate((X,Y))
Z=pd.DataFrame(Z) #converting into data frame for ease

Plotting the data:

sns.scatterplot(Z[0],Z[1])

Output

Image by author

Applying KMeans Clustering with 2 clusters:

KMean= KMeans(n_clusters=2)
KMean.fit(Z)
label=KMean.predict(Z)

Calculating the silhouette score:

print(f'Silhouette Score(n=2): {silhouette_score(Z, label)}')

Output: Silhouette Score(n=2): 0.8062146115881652

We can say that the clusters are well apart from each other as the silhouette score is closer to 1.

To check whether our silhouette score is providing the right information or not let’s create another scatter plot showing labelled data points.

sns.scatterplot(Z[0],Z[1],hue=label)

Output:

Image by author

It can be seen clearly in the above figure that each cluster is well apart from each other.

Let’s try with 3 clusters:

KMean= KMeans(n_clusters=3)
KMean.fit(Z)
label=KMean.predict(Z)
print(f’Silhouette Score(n=3): {silhouette_score(Z, label)}’)
sns.scatterplot(Z[0],Z[1],hue=label,palette=’inferno_r’)

Output:

Silhouette Score(n=3): 0.5969732708311737

Image by author

As you can see in the above figure clusters are not well apart. The inter cluster distance between cluster 1 and cluster 2 is almost negligible. That is why the silhouette score for n= 3(0.596) is lesser than that of n=2(0.806).

When dealing with higher dimensions, the silhouette score is quite useful to validate the working of clustering algorithm as we can’t use any type of visualization to validate clustering when dimensions are greater than 3.

We can also use the silhouette score to check the optimal number of clusters. In the above example, we can say that the optimal number of clusters is 2 as its silhouette score is greater than that of 3 clusters.

Elbow Method in Clustering

Shishir Kant Singh — Sun, 29 Aug 2021 08:37:28 +0000

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
We now demonstrate the given method using the K-Means clustering technique using the Sklearn library of python.

Step 1: Importing the required libraries

from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

Step 2: Creating and Visualizing the data

# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
 
# Visualizing the data
plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()

From the above visualization, we can see that the optimal number of clusters should be around 3. But visualizing the data alone cannot always give the right answer. Hence we demonstrate the following steps.
We now define the following:-

Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.

We iterate the values of k from 1 to 9 and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.

Step 3: Building the clustering model and calculating the values of the Distortion and Inertia:

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
 
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
 
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

Step 4: Tabulating and Visualizing the results

a) Using the different values of Distortion:

for key, val in mapping1.items():
    print(f'{key} : {val}')

plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

b) Using the different values of Inertia:

for key, val in mapping2.items():
    print(f'{key} : {val}')

Python3

plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 3.
The clustered data points for different value of k:-
1. k = 1

2. k = 2

3. k = 3

4. k = 4

Hierarchical Clustering in Python

Shishir Kant Singh — Sun, 29 Aug 2021 08:25:32 +0000

Introduction to Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories −

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters.

Steps to Perform Agglomerative Hierarchical Clustering

We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −

Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters at start. The number of data points will also be K at start.
Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters.
Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters.
Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.
Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.

Role of Dendrograms in Agglomerative Hierarchical Clustering

As we discussed in the last step, the role of dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example −

Example 1

To understand, let us start with importing the required libraries as follows −

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Next, we will be plotting the datapoints we have taken for this example −

X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
   plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords='offset points', ha='right', va='bottom')
plt.show()

From the above diagram, it is very easy to see that we have two clusters in out datapoints but in the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library −

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top',labels=labelList, distance_sort='descending',show_leaf_counts=True)
plt.show()

Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points, the number of clusters would be two.

Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library −

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)

Next, plot the cluster with the help of following code −

plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap='rainbow')

The above diagram shows the two clusters from our datapoints.

Example2

As we understood the concept of dendrograms from the simple example discussed above, let us move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering −

import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import numpy as np
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
data.shape
(768, 9)
data.head()

slno.	preg	Plas	Pres	skin	test	mass	pedi	age	class
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

patient_data = data.iloc[:, 3:5].values
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Patient Dendograms")
dend = shc.dendrogram(shc.linkage(data, method='ward'))

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster.fit_predict(patient_data)
plt.figure(figsize=(10, 7))
plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap='rainbow')

Mean-Shift Clustering in Python

Shishir Kant Singh — Sun, 29 Aug 2021 08:23:23 +0000

Introduction to Mean-Shift Algorithm

As discussed earlier, it is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions; hence it is a non-parametric algorithm.

Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints i.e. cluster centroid.

The difference between K-Means algorithm and Mean-Shift is that later one does not need to specify the number of clusters in advance because the number of clusters will be determined by the algorithm w.r.t data.

Working of Mean-Shift Algorithm

We can understand the working of Mean-Shift clustering algorithm with the help of following steps −

Step 1 − First, start with the data points assigned to a cluster of their own.
Step 2 − Next, this algorithm will compute the centroids.
Step 3 − In this step, location of new centroids will be updated.
Step 4 − Now, the process will be iterated and moved to the higher density region.
Step 5 − At last, it will be stopped once the centroids reach at position from where it cannot move further.

Implementation in Python

It is a simple example to understand how Mean-Shift algorithm works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply Mean-Shift algorithm to see the result.

%matplotlib inline
import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.datasets.samples_generator import make_blobs
centers = [[3,3,3],[4,5,5],[3,10,10]]
X, _ = make_blobs(n_samples = 700, centers = centers, cluster_std = 0.5)
plt.scatter(X[:,0],X[:,1])
plt.show()

ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Estimated clusters:", n_clusters_)
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 3)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
    marker=".",color='k', s=20, linewidths = 5, zorder=10)
plt.show()

Output

[[ 2.98462798 9.9733794 10.02629344]
[ 3.94758484 4.99122771 4.99349433]
[ 3.00788996 3.03851268 2.99183033]]
Estimated clusters: 3

Advantages and Disadvantages

Advantages

The following are some advantages of Mean-Shift clustering algorithm −

It does not need to make any model assumption as like in K-means or Gaussian mixture.
It can also model the complex clusters which have nonconvex shape.
It only needs one parameter named bandwidth which automatically determines the number of clusters.
There is no issue of local minima as like in K-means.
No problem generated from outliers.

Disadvantages

The following are some disadvantages of Mean-Shift clustering algorithm −

Mean-shift algorithm does not work well in case of high dimension, where number of clusters changes abruptly.

We do not have any direct control on the number of clusters but in some applications, we need a specific number of clusters.
It cannot differentiate between meaningful and meaningless modes.

K-Means Clustering in Python

Shishir Kant Singh — Sun, 29 Aug 2021 08:20:43 +0000

Introduction to K-Means Algorithm

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

We can understand the working of K-Means clustering algorithm with the help of following steps −

Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points.
Step 3 − Now it will compute the cluster centroids.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more −

4.1 − First, the sum of squared distance between data points and centroids would be computed.

4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster.

While working with K-means algorithm we need to take care of the following things −

While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points.
Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids.

Implementation in Python

The following two examples of implementing K-Means clustering algorithm will help us in its better understanding −

Example 1

It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

The following code will generate the 2D, containing four blobs −

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

Next, the following code will help us to visualize the dataset −

plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()

Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows −

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator −

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()

Example 2

Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information.

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows −

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

Output

(1797, 64)

The above output shows that this dataset is having 1797 samples with 64 features.

We can perform the clustering as we did in Example 1 above −

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

Output

(10, 64)

The above output shows that K-means created 10 clusters with 64 features.

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks=[])
   axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Output

As output, we will get following image showing clusters centers learned by k-means.

The following lines of code will match the learned cluster labels with the true labels found in them −

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
   mask = (clusters == i)
   labels[mask] = mode(digits.target[mask])[0]

Next, we can check the accuracy as follows −

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

Output

0.7935447968836951

The above output shows that the accuracy is around 80%.

Advantages and Disadvantages

Advantages

The following are some advantages of K-Means clustering algorithms −

It is very easy to understand and implement.
If we have large number of variables then, K-means would be faster than Hierarchical clustering.
On re-computation of centroids, an instance can change the cluster.
Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

The following are some disadvantages of K-Means clustering algorithms −

It is a bit difficult to predict the number of clusters i.e. the value of k.
Output is strongly impacted by initial inputs like number of clusters (value of k).
Order of data will have strong impact on the final output.
It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization, then the output will completely change.final output.
It is not good in doing clustering job if the clusters have a complicated geometric shape.

Applications of K-Means Clustering Algorithm

The main goals of cluster analysis are −

To get a meaningful intuition from the data we are working with.
Cluster-then-predict where different models will be built for different subgroups.

To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in following applications −

Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data