Differences between Jaccard similarity and Cosine Similarity

Uncategorized

Jaccard Similarity and Cosine Similarity are two common methods used to measure the similarity between two sets or two vectors. They are used in different scenarios and have different formulas.

  1. Jaccard Similarity: Jaccard Similarity is primarily used for sets or binary data. It compares the presence or absence of elements between two sets. The formula for Jaccard Similarity is:
J(A, B) = |A ∩ B| / |A ∪ B|

Where:

  • |A ∩ B| is the size of the intersection of sets A and B.
  • |A ∪ B| is the size of the union of sets A and B. Jaccard Similarity ranges from 0 (no common elements) to 1 (identical sets).

2. Cosine Similarity: Cosine Similarity is commonly used for vector representations, particularly in text analysis or document comparisons. It calculates the cosine of the angle between two vectors. The formula for Cosine Similarity is:

cos(θ) = (A · B) / (||A|| ||B||)

Where:

  • A · B is the dot product of vectors A and B.
  • ||A|| and ||B|| are the magnitudes (or lengths) of vectors A and B. Cosine Similarity ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonality.

Uses:-

   <?php

        if ($_SERVER["REQUEST_METHOD"] == "POST") {
            $user_content = isset($_POST["user_content"]) ? $_POST["user_content"] : '';

            if (!empty($user_content)) {

                // echo "Searching for: $user_content<br>";
                // Your Google API key
                $apiKey = '';

                // Your Custom Search Engine ID
                $cx = '';

                // Function to fetch Google search results using Custom Search JSON API with pagination
                function fetchGoogleResults($query, $apiKey, $cx, $startIndex = 1)
                {
                    // Define the number of results per page
                    $resultsPerPage = 10;

                    // Calculate the start index for the current page
                    $start = ($startIndex - 1) * $resultsPerPage + 1;

                    // Build the API request URL with the start parameter
                    $url = "https://www.googleapis.com/customsearch/v1?key=$apiKey&cx=$cx&q=" . urlencode($query) . "&start=$start";

                    // Fetch the results using file_get_contents
                    $response = file_get_contents($url);

                    return $response;
                }

                // Fetch and extract Google search results from multiple pages
                $matches = [];

                for ($page = 1; $page <= 5; $page++) { // You can adjust the number of pages as needed
                    $google_results_page = fetchGoogleResults($user_content, $apiKey, $cx, $page);
                   
                    $google_results_page_data = json_decode($google_results_page, true);
                   
                    // Check if there are items in the current page
                    if (isset($google_results_page_data['items'])) {
                        foreach ($google_results_page_data['items'] as $item) {
                            
                            // Tokenize the strings
                            $user_tokens = array_count_values(preg_split('/\W+/u', $user_content, -1, PREG_SPLIT_NO_EMPTY));
                            $google_tokens = array_count_values(preg_split('/\W+/u', $item['snippet'], -1, PREG_SPLIT_NO_EMPTY));
                           
                            // Count the matching words
                            $matching_words = array_intersect_key($user_tokens, $google_tokens);
                            $total_matching_words = array_sum($matching_words);
                    
                            // Calculate Jaccard similarity (check for division by zero)
                            $jaccard_percentage = (count($user_tokens) > 0) ? ($total_matching_words / count($user_tokens)) * 100 : 0;
                    
                            // Calculate cosine similarity
                            $dotProduct = 0;
                            $norm1 = 0;
                            $norm2 = 0;
                            foreach ($user_tokens as $word => $count) {
                                $dotProduct += $count * ($google_tokens[$word] ?? 0);
                                $norm1 += $count ** 2;
                            }
                            foreach ($google_tokens as $count) {
                                $norm2 += $count ** 2;
                            }
                    
                            // Check for division by zero before calculating cosine similarity
                            $cosine = ($norm1 * $norm2 > 0) ? $dotProduct / sqrt($norm1 * $norm2) : 0;
                            $cosine_percentage = $cosine * 100;
                    
                            // Add match to array
                            $matches[] = [
                                'title' => $item['title'],
                                'link' => $item['link'],
                                'snippet' => $item['snippet'],
                                'jaccard' => $jaccard_percentage,
                                'cosine' => $cosine_percentage
                            ];
                        }
                    }
                    
                }

                // Calculate overall plagiarism and uniqueness based on the most similar match
                $max_cosine = max(array_column($matches, 'cosine'));

                $plagiarism_percentage = $max_cosine;
                $uniqueness_percentage = 100 - $plagiarism_percentage;

                // Display overall results
                echo "<h2>Overall Plagiarism Check Results</h2>";
                echo "<p>Plagiarism: " . number_format($plagiarism_percentage, 2) . "%</p>";
                echo "<p>Unique: " . number_format($uniqueness_percentage, 2) . "%</p>";

                // Display detailed matches
                // Sort the matches by cosine similarity in descending order
                usort($matches, function ($a, $b) {
                    return $b['cosine'] - $a['cosine'];
                });

                // Display only the top 5 matches
                $top_matches = array_slice($matches, 0, 5);

                // Display overall results
                echo "<h2>Top 5 Matches</h2>";

                if (!empty($top_matches)) {
                    // Display detailed matches
                    foreach ($top_matches as $match) {
                        echo "<div>";
                        echo "<h3>" . $match['title'] . "</h3>";
                        echo "<p>" . $match['snippet'] . "</p>";
                        echo "<p>Jaccard Similarity: " . number_format($match['jaccard'], 2) . "%</p>";
                        echo "<p>Cosine Similarity: " . number_format($match['cosine'], 2) . "%</p>";
                        echo "<a href='" . $match['link'] . "' target='_blank'>Read More</a>";
                        echo "</div>";
                        echo "<hr>";
                    }
                } else {
                    echo "<p>No matches found.</p>";
                }

                // Clear the textarea content
                $_POST["user_content"] = "";
            } else {
                echo "<p>Please provide content for plagiarism check.</p>";
            }
        }

        ?>

Result:-

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x