Differences between Jaccard similarity and Cosine Similarity

Jaccard Similarity and Cosine Similarity are two common methods used to measure the similarity between two sets or two vectors. They are used in different scenarios and have different formulas.

  1. Jaccard Similarity: Jaccard Similarity is primarily used for sets or binary data. It compares the presence or absence of elements between two sets. The formula for Jaccard Similarity is:
J(A, B) = |A ∩ B| / |A ∪ B|

Where:

  • |A ∩ B| is the size of the intersection of sets A and B.
  • |A ∪ B| is the size of the union of sets A and B. Jaccard Similarity ranges from 0 (no common elements) to 1 (identical sets).

2. Cosine Similarity: Cosine Similarity is commonly used for vector representations, particularly in text analysis or document comparisons. It calculates the cosine of the angle between two vectors. The formula for Cosine Similarity is:

cos(θ) = (A · B) / (||A|| ||B||)

Where:

  • A · B is the dot product of vectors A and B.
  • ||A|| and ||B|| are the magnitudes (or lengths) of vectors A and B. Cosine Similarity ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonality.

Uses:-

   <?php

        if ($_SERVER["REQUEST_METHOD"] == "POST") {
            $user_content = isset($_POST["user_content"]) ? $_POST["user_content"] : '';

            if (!empty($user_content)) {

                // echo "Searching for: $user_content<br>";
                // Your Google API key
                $apiKey = '';

                // Your Custom Search Engine ID
                $cx = '';

                // Function to fetch Google search results using Custom Search JSON API with pagination
                function fetchGoogleResults($query, $apiKey, $cx, $startIndex = 1)
                {
                    // Define the number of results per page
                    $resultsPerPage = 10;

                    // Calculate the start index for the current page
                    $start = ($startIndex - 1) * $resultsPerPage + 1;

                    // Build the API request URL with the start parameter
                    $url = "https://www.googleapis.com/customsearch/v1?key=$apiKey&cx=$cx&q=" . urlencode($query) . "&start=$start";

                    // Fetch the results using file_get_contents
                    $response = file_get_contents($url);

                    return $response;
                }

                // Fetch and extract Google search results from multiple pages
                $matches = [];

                for ($page = 1; $page <= 5; $page++) { // You can adjust the number of pages as needed
                    $google_results_page = fetchGoogleResults($user_content, $apiKey, $cx, $page);
                   
                    $google_results_page_data = json_decode($google_results_page, true);
                   
                    // Check if there are items in the current page
                    if (isset($google_results_page_data['items'])) {
                        foreach ($google_results_page_data['items'] as $item) {
                            
                            // Tokenize the strings
                            $user_tokens = array_count_values(preg_split('/\W+/u', $user_content, -1, PREG_SPLIT_NO_EMPTY));
                            $google_tokens = array_count_values(preg_split('/\W+/u', $item['snippet'], -1, PREG_SPLIT_NO_EMPTY));
                           
                            // Count the matching words
                            $matching_words = array_intersect_key($user_tokens, $google_tokens);
                            $total_matching_words = array_sum($matching_words);
                    
                            // Calculate Jaccard similarity (check for division by zero)
                            $jaccard_percentage = (count($user_tokens) > 0) ? ($total_matching_words / count($user_tokens)) * 100 : 0;
                    
                            // Calculate cosine similarity
                            $dotProduct = 0;
                            $norm1 = 0;
                            $norm2 = 0;
                            foreach ($user_tokens as $word => $count) {
                                $dotProduct += $count * ($google_tokens[$word] ?? 0);
                                $norm1 += $count ** 2;
                            }
                            foreach ($google_tokens as $count) {
                                $norm2 += $count ** 2;
                            }
                    
                            // Check for division by zero before calculating cosine similarity
                            $cosine = ($norm1 * $norm2 > 0) ? $dotProduct / sqrt($norm1 * $norm2) : 0;
                            $cosine_percentage = $cosine * 100;
                    
                            // Add match to array
                            $matches[] = [
                                'title' => $item['title'],
                                'link' => $item['link'],
                                'snippet' => $item['snippet'],
                                'jaccard' => $jaccard_percentage,
                                'cosine' => $cosine_percentage
                            ];
                        }
                    }
                    
                }

                // Calculate overall plagiarism and uniqueness based on the most similar match
                $max_cosine = max(array_column($matches, 'cosine'));

                $plagiarism_percentage = $max_cosine;
                $uniqueness_percentage = 100 - $plagiarism_percentage;

                // Display overall results
                echo "<h2>Overall Plagiarism Check Results</h2>";
                echo "<p>Plagiarism: " . number_format($plagiarism_percentage, 2) . "%</p>";
                echo "<p>Unique: " . number_format($uniqueness_percentage, 2) . "%</p>";

                // Display detailed matches
                // Sort the matches by cosine similarity in descending order
                usort($matches, function ($a, $b) {
                    return $b['cosine'] - $a['cosine'];
                });

                // Display only the top 5 matches
                $top_matches = array_slice($matches, 0, 5);

                // Display overall results
                echo "<h2>Top 5 Matches</h2>";

                if (!empty($top_matches)) {
                    // Display detailed matches
                    foreach ($top_matches as $match) {
                        echo "<div>";
                        echo "<h3>" . $match['title'] . "</h3>";
                        echo "<p>" . $match['snippet'] . "</p>";
                        echo "<p>Jaccard Similarity: " . number_format($match['jaccard'], 2) . "%</p>";
                        echo "<p>Cosine Similarity: " . number_format($match['cosine'], 2) . "%</p>";
                        echo "<a href='" . $match['link'] . "' target='_blank'>Read More</a>";
                        echo "</div>";
                        echo "<hr>";
                    }
                } else {
                    echo "<p>No matches found.</p>";
                }

                // Clear the textarea content
                $_POST["user_content"] = "";
            } else {
                echo "<p>Please provide content for plagiarism check.</p>";
            }
        }

        ?>

Result:-

Related Posts

Professional development journey using CDOA – Certified DataOps Architect

Introduction The CDOA – Certified DataOps Architect is a professional designation designed to address the unique challenges of managing and scaling data delivery in cloud-native environments. This…

Read More

Achieve Data Reliability with CDOE – Certified DataOps Engineer Program

Introduction The CDOE – Certified DataOps Engineer is established as a critical benchmark for professionals aiming to master the intersection of data engineering and operational excellence. This…

Read More

Explore deeper with Certified MLOps Manager monitoring and automation basics

Introduction The gap between developing a machine learning model and deploying it into a reliable production environment is where most artificial intelligence projects fail. The Certified MLOps…

Read More

Certified MLOps Architect: Skills, Syllabus, and Career Opportunities Explained Clearly

Introduction The Certified MLOps Architect is a comprehensive program designed for professionals who want to bridge the gap between machine learning and production engineering. This guide is…

Read More

Advanced Certified MLOps Professional Program for Scalable AI Model Deployment Systems

Introduction The Certified MLOps Professional program from AIOpsSchool has emerged as a vital benchmark for engineers looking to bridge the gap between data science and production engineering….

Read More

Powerful Certified MLOps Engineer Program to Build Reliable ML Infrastructure

Introduction The integration of Machine Learning into production environments has created a significant gap between data science and traditional software engineering. The Certified MLOps Engineer program is…

Read More
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x