{"id":17504,"date":"2024-05-08T17:41:21","date_gmt":"2024-05-08T21:41:21","guid":{"rendered":"https:\/\/ipullrank.com\/?p=17504"},"modified":"2025-07-31T15:51:35","modified_gmt":"2025-07-31T19:51:35","slug":"vector-embeddings-is-all-you-need","status":"publish","type":"post","link":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need","title":{"rendered":"Vector Embeddings is All You Need: SEO Use Cases for Vectorizing the Web with Screaming Frog"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"17504\" class=\"elementor elementor-17504\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-0eb0a34 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"0eb0a34\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2b9bdc2\" data-id=\"2b9bdc2\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-16d6878 elementor-widget elementor-widget-text-editor\" data-id=\"16d6878\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Since learning of the importance and magnitude of vector embeddings, I have been proposing that a link index should vectorize the web and make those representations of pages available to SEOs. Fundamentally, with the further integration of machine learning in Google\u2019s ranking systems, vector embeddings are even more important to what we do than an understanding of the link graph. Currently, only search engines and large language modelers have this data at scale. I believe every SEO tool should be providing this data about pages and keywords.<\/p><p class=\"p1\">Google has been leveraging vector embeddings to understand the semantics of the web since the introduction of <a href=\"https:\/\/research.google\/pubs\/distributed-representations-of-words-and-phrases-and-their-compositionality\/\"><span class=\"s1\">Word2Vec<\/span><\/a> in 2013 (<a href=\"https:\/\/medium.com\/@manansuri\/a-dummys-guide-to-word2vec-456444f3c673\"><span class=\"s1\">here\u2019s a simpler explanation<\/span><\/a>) through the update we knew as Hummingbird, but the SEO software industry has continued to operate predominantly on the lexical model of natural language understanding. Lesser known tools like InLinks, WordLift, MarketMuse, MarketBrew and the various keyword clustering tools have all done things with this technology, but the popular SEO tools have not surfaced many semantic features.<\/p><p class=\"p1\">To some degree this showcases the gap between modern information retrieval and search engine optimization as industries. The common (simplified) understanding of Google in our space is that their systems simply crawl pages, break the content into its lexical components, counts the presence, prominence, and distribution of words, reviews linking relationships, ranks pages by expanding queries and breaking them into n-grams. Once it retrieves posting lists based on the n-grams, it intersects the results, scores that intersected list, sorts by the score and presents the rankings. Then it reinforces what ranks based on user signals. Structurally, the model architecture looks like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4a1e60 elementor-widget elementor-widget-image\" data-id=\"b4a1e60\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"800\" height=\"452\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-1024x579.png\" class=\"attachment-large size-large wp-image-17527\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-1024x579.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-300x170.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-768x434.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-825x466.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19-945x534.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/image19.png 1449w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed1ffb7 elementor-widget elementor-widget-text-editor\" data-id=\"ed1ffb7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">That model is not wrong per se, because Google still does these things. It\u2019s just not at all indicative of the state of the art with which Google Search operates because it does so much more.<\/p><p class=\"p1\">Distinguished Google Researcher, Marc Najork, in his <a href=\"https:\/\/docs.google.com\/presentation\/d\/19lAeVzPkh20Ly855tKDkz1uv-1pHV_9GxfntiTJPUug\/edit\"><span class=\"s1\">Generative Information Retrieval presentation<\/span><\/a> (where the above image comes from) discussed how the state of the art has evolved to fusion-based approaches that are a hybrid between lexical and semantic models. Citing his Google Research team\u2019s own 2020 paper <a href=\"https:\/\/arxiv.org\/abs\/2010.01195\"><span class=\"s1\">Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach<\/span><\/a> where they showcase open source libraries to implement and examine the viability of the method. They combine <a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\"><span class=\"s1\">BM25<\/span><\/a> (the lexical retrieval model SEO effectively still operates on) with Google\u2019s <a href=\"https:\/\/research.google\/blog\/announcing-scann-efficient-vector-similarity-search\/\"><span class=\"s1\">SCaNN<\/span><\/a> package for vector search with <a href=\"https:\/\/huggingface.co\/blog\/bert-101\"><span class=\"s1\">BERT<\/span><\/a> for dense vector embeddings and combine and re-rank the results with a method called RM3.\u00a0<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-42a0593 elementor-widget elementor-widget-image\" data-id=\"42a0593\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"800\" height=\"479\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/the-hybrid-retrieval-approach.png\" class=\"attachment-large size-large wp-image-17505\" alt=\"the hybrid retrieval approach diagram\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/the-hybrid-retrieval-approach.png 906w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/the-hybrid-retrieval-approach-300x180.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/the-hybrid-retrieval-approach-768x460.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/the-hybrid-retrieval-approach-825x494.png 825w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff1c99e elementor-widget elementor-widget-text-editor\" data-id=\"ff1c99e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">One of the key differences in the models is the advent of nearest neighbor searching with dense vectors. This is why rankings often no longer behave the way we anticipate. Semantic matching, and Google\u2019s specific improvements on it <a href=\"https:\/\/blog.google\/products\/search\/how-ai-powers-great-search-results\/#:~:text=But%20it%20wasn%E2%80%99t%20until%202018%2C%20when%20we%20introduced%20neural%20matching%20to%20Search%2C%20that%20we%20could%20use%20them%20to%20better%20understand%20how%20queries%20relate%20to%20pages.\"><span class=\"s1\">\u201cneural matching,\u201d<\/span><\/a> is often a \u201cfuzzier\u201d understanding of relevance when you\u2019re used to seeing the explicit presence of words in page titles, h1 tags, and distributed across body text and more links with targeted anchor text.\u00a0<\/p><p class=\"p1\">In fact, our mental model of how Google works is quite out of date. Based on what Najork presents in his deck, it looks a lot more like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9963dc0 elementor-widget elementor-widget-image\" data-id=\"9963dc0\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"800\" height=\"448\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-1024x574.png\" class=\"attachment-large size-large wp-image-17506\" alt=\"architecture of hybrid lexical:semantic retrieval system\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-1024x574.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-300x168.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-768x430.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-825x462.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system-945x529.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/architecture-of-hybrid-lexicalsemantic-retrieval-system.png 1453w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6260c37 elementor-widget elementor-widget-text-editor\" data-id=\"6260c37\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">As natural language processing technology has yielded denser embeddings (as compared to the sparse embeddings featured in approaches like TF-IDF), Google has improved its ability to capture and associate information on a passage, page, site, and author level. Google moved on a long time ago, but with the rapid advancements in vector embeddings we can catch up.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-56ea124 elementor-widget elementor-widget-heading\" data-id=\"56ea124\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">What are Vector Embeddings?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0690a5d elementor-widget elementor-widget-text-editor\" data-id=\"0690a5d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">As I explained in my <a href=\"https:\/\/ipullrank.com\/content-relevance\"><span class=\"s1\">\u201cRelevance is not Qualitative Measure for Search Engines\u201d<\/span><\/a> piece, the vector space model is what powers the understanding of relevance between queries and documents. Vector embeddings are used to represent the query and the documents in that model.<\/p><p class=\"p1\">I explained this in that post, but in the spirit of saving you a click and improving the relevance for this post, vector embeddings are a powerful technique in natural language processing (NLP) that represent words, phrases, or documents by plotting them as coordinates in multi-dimensional space. These vectors capture the semantic meaning and relationships between words, allowing machines to understand the nuances of language.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52d712d elementor-widget elementor-widget-image\" data-id=\"52d712d\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"366\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-1024x468.png\" class=\"attachment-large size-large wp-image-17507\" alt=\"embedding model diagram\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-1024x468.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-300x137.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-768x351.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-1536x702.png 1536w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-825x377.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram-945x432.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/embedding-model-diagram.png 1999w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc40f6c elementor-widget elementor-widget-text-editor\" data-id=\"bc40f6c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">While TF-IDF and its variants yielded simplified word vectors that indicated the presence of words based on a given page&#8217;s vocabulary, the history of modern vector embeddings goes back a little over a decade ago. Seminal works like the aforementioned word2vec were introduced in 2013 and capabilities have rapidly improved since the advent of Google\u2019s <a href=\"https:\/\/research.google\/blog\/transformer-a-novel-neural-network-architecture-for-language-understanding\/\"><span class=\"s1\">Transformer<\/span><\/a>. These models learned word relationships by analyzing large text corpora, positioning similar words close together in the vector space. Transformer improved upon a concept called <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\"><span class=\"s1\">Attention<\/span><\/a> wherein the language model developed the ability to also understand context and polysemy. So when I have the sentence \u201cShe bats her eyelashes flirtatiously at her date across the table\u201d and the sentence \u201cAt dusk, bats emerged from the cave, flitting about in search of insects,\u201d modern language models can now understand that the second usage of the word \u201cbats\u201d means the noun representing the animal while the first usage of the word \u201cbats\u201d is the verb representing the physical action.<\/p><p class=\"p1\">Dense vector embeddings revolutionized search by improving upon semantic search (or vector search), which goes beyond keyword matching to understand the intent behind a query and the meaning of the documents being considered. Search engines can now identify synonyms and related concepts, leading to more relevant and accurate results.\u00a0 When we say we\u2019ve moved form keywords to concepts this is what we\u2019re talking about. Information Retrieval is no longer solely reliant on the presence of specific words, a <i>concept<\/i> can be represented and measured.<\/p><p class=\"p1\">Further still, vectors allow Google to effectively model representations of <a href=\"https:\/\/research.google\/pubs\/multi-aspect-dense-retrieval\/\"><span class=\"s1\">queries<\/span><\/a>, <a href=\"https:\/\/arxiv.org\/abs\/1909.10506\"><span class=\"s1\">entities<\/span><\/a>, individual sentences, <a href=\"https:\/\/patents.google.com\/patent\/US11275895B1\/en\"><span class=\"s1\">authors<\/span><\/a>, <a href=\"https:\/\/gofishdigital.com\/blog\/website-representation-vectors\/\"><span class=\"s1\">websites<\/span><\/a>, and use those representations to fulfill the ideas behind <a href=\"https:\/\/ipullrank.com\/why-e-a-t-core-updates-will-change-your-content-approach\">E-E-A-T<\/a>.<\/p><p class=\"p1\">State-of-the-art vector embeddings are trained on massive datasets and incorporate contextual information to capture complex relationships between words. This ongoing development continues to improve the accuracy and efficiency of search algorithms.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e09141f elementor-widget elementor-widget-heading\" data-id=\"e09141f\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Enter Screaming Frog SEO Spider\u2019s Custom JavaScript<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af909f8 elementor-widget elementor-widget-text-editor\" data-id=\"af909f8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\"><span class=\"s1\"><a href=\"https:\/\/www.screamingfrog.co.uk\/seo-spider\/\">Screaming Frog SEO Spider<\/a><\/span> has been nudging the SEO space forward for over a decade. The team has continued to innovate in ways that the industry needs, and the SaaS tools are too slow to do. Being in the cloud has its advantages for scalability and speed, but there is nothing any of those tools can do that SFSS can&#8217;t, and plenty that it can do that those other tools won&#8217;t. When the SF team launches cutting edge features other tools should consider altering their roadmap.<\/p><p class=\"p1\">Version 20 is no different. What&#8217;s great for my cause is that it seems they are in agreement that someone should help us vectorize the web.<\/p><p class=\"p1\">Their new Custom JavaScript functionality allows users to run bespoke JS functions on pages as the spider crawls them. You can now do customized analysis and make calls to third party sources to enhance your crawl data.\u00a0<\/p><p class=\"p1\">While a lot of SEOs are going to use this upgrade to turn SFSS into Scrapebox, one of the operations that comes with the tool is code to generate vector embeddings from OpenAI as you crawl. This is specifically what is going to help you make the upgrade from lexical analysis to semantic.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-afc0c8b elementor-widget elementor-widget-heading\" data-id=\"afc0c8b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">JavaScript Functions in Screaming Frog<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cac4df elementor-widget elementor-widget-text-editor\" data-id=\"8cac4df\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">I was going to explain the different functions for accessing things in SFSS, but instead I made you a <a href=\"https:\/\/chatgpt.com\/g\/g-JvVDxK5wj-kermit\"><span class=\"s1\">custom GPT called Kermit<\/span><\/a> with <a href=\"https:\/\/www.screamingfrog.co.uk\/seo-spider\/user-guide\/configuration\/#custom-javascript\"><span class=\"s1\">the documentation<\/span><\/a> to help you write your code. <i>Thank me later.<\/i><\/p><p class=\"p1\">The main functional capabilities to know are that you can:<\/p><ul class=\"ul1\"><li class=\"li1\">run actions on the page\u00a0<\/li><li class=\"li1\">Run extractions from the page<\/li><li class=\"li1\">save files based on either\u00a0<\/li><li class=\"li1\">load external scripts<\/li><li class=\"li1\">run multiple operations<\/li><li class=\"li1\">perform operations from the <a href=\"https:\/\/developer.chrome.com\/docs\/devtools\/console\/utilities\"><span class=\"s1\">Chrome Utilities API<\/span><\/a><\/li><\/ul><p class=\"p1\">Unless you\u2019re a JavaScript wizard, I recommend leaning heavily on the custom GPT to get yourself started. I also recommend contributing what you build to this public repository of SFSS custom JS scripts.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98a7944 elementor-widget elementor-widget-heading\" data-id=\"98a7944\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">How to Vectorize a Site with Screaming Frog SEO Spider<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6509858 elementor-widget elementor-widget-text-editor\" data-id=\"6509858\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">If you&#8217;ve used custom extractions in SFSS before, the new Custom JS extraction functionality is an expansion of that. You can define what you want to execute at runtime and store it in unique columns. To get you started, the SF team has prepared a series of templates, including one for capturing embeddings from OpenAI as you crawl. Here\u2019s how you do it:<\/p><ol class=\"ol1\"><li class=\"li1\">In the Crawl Config select Custom &gt; Custom JavaScript<\/li><\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3bb7429 elementor-widget elementor-widget-image\" data-id=\"3bb7429\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"494\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-1024x632.png\" class=\"attachment-large size-large wp-image-17508\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-1024x632.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-300x185.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-768x474.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-825x509.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript-945x583.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-crawl-config-custom-javascript.png 1304w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cad53f5 elementor-widget elementor-widget-text-editor\" data-id=\"cad53f5\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">This brings you to a dialog box where you can add and test your custom JS code. Click Add from Library to get the party started.<\/p>\n\n<ol class=\"ol1\" start=\"2\">\n \t<li class=\"li1\">Select the ChatGPT extract embeddings from page content<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ff5e82 elementor-widget elementor-widget-image\" data-id=\"7ff5e82\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"525\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-open-ai-extraction.png\" class=\"attachment-large size-large wp-image-17509\" alt=\"screaming frog - open ai extraction\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-open-ai-extraction.png 722w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-open-ai-extraction-300x218.png 300w\" sizes=\"(max-width: 722px) 100vw, 722px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a185083 elementor-widget elementor-widget-text-editor\" data-id=\"a185083\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">In the system tab you\u2019ll find a series of custom extractions and actions. There are things that can be pulled without external functionality as well as things that can use ChatGPT\u2019s API or a local LLM via <a href=\"https:\/\/ollama.com\/\"><span class=\"s1\">ollama<\/span><\/a>. For our purposes, you\u2019ll want the \u201c(ChatGPT) Extract embeddings from page content\u201d function.<\/p>\n\n<ol class=\"ol1\" start=\"3\">\n \t<li class=\"li1\">Enter your OpenAI API key.<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14f7989 elementor-widget elementor-widget-image\" data-id=\"14f7989\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"566\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-1024x724.png\" class=\"attachment-large size-large wp-image-17510\" alt=\"Screaming Frog - Custom JavaScript Snippet Editor - OpenAI API Key\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-1024x724.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-300x212.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-768x543.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-825x583.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key-945x668.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Screaming-Frog-Custom-JavaScript-Snippet-Editor-OpenAI-API-Key.png 1122w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8e31765 elementor-widget elementor-widget-text-editor\" data-id=\"8e31765\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Once you\u2019ve selected the extraction, you\u2019ll need to configure the code by adding your <a href=\"https:\/\/help.openai.com\/en\/articles\/4936850-where-do-i-find-my-openai-api-key\"><span class=\"s1\">OpenAI API key<\/span><\/a>. You can test it on the right by adding a URL and clicking test. What you\u2019ll get back is a series of decimal numbers. These are your embeddings.<\/p>\n\n<ol class=\"ol1\" start=\"4\">\n \t<li class=\"li1\">Configure your crawl as you normally would and make sure to enable JavaScript rendering and that crawling external links is enabled. Then let it run as normal.<\/li>\n<\/ol>\n<p class=\"p1\">When you get your data back your Custom JavaScript tab will look like this:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7aae18a elementor-widget elementor-widget-image\" data-id=\"7aae18a\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"427\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-1024x547.png\" class=\"attachment-large size-large wp-image-17511\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-1024x547.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-300x160.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-768x411.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-825x441.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings-945x505.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/screaming-frog-openai-embeddings.png 1083w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7cd84a8 elementor-widget elementor-widget-text-editor\" data-id=\"7cd84a8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">By default, the embeddings will only be computed on pages that are of text\/html type, but embeddings can be multimodal, so if you wanted to compute them on images you could. For that you\u2019d have to adjust the Content Types that the JS fires on and pass the images as bytes.<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cf05a0 elementor-widget elementor-widget-heading\" data-id=\"8cf05a0\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Do I Have to Use OpenAI?<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b548e6 elementor-widget elementor-widget-text-editor\" data-id=\"3b548e6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">No, you don\u2019t. In fact, according to the <a href=\"https:\/\/huggingface.co\/spaces\/mteb\/leaderboard\"><span class=\"s1\">HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard<\/span><\/a>, they are not considered state of the art at this point. Google\u2019s text-embedding-preview-0409 embeddings model is smaller with lower dimensionality and outperforms OpenAI\u2019s <code>text-embedding-3-large<\/code> embeddings model in all tasks.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a61dcec elementor-widget elementor-widget-image\" data-id=\"a61dcec\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"427\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-1024x546.png\" class=\"attachment-large size-large wp-image-17512\" alt=\"hugging face embedding benchmarks\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-1024x546.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-300x160.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-768x409.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-1536x818.png 1536w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-825x440.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks-945x503.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hugging-face-embedding-benchmarks.png 1708w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8eb46f7 elementor-widget elementor-widget-text-editor\" data-id=\"8eb46f7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-weight: 400;\">So, if we want closer parity to what Google is using in production, then we\u2019d want to use the <\/span><a href=\"https:\/\/cloud.google.com\/blog\/products\/ai-machine-learning\/google-cloud-announces-new-text-embedding-models\"><span style=\"font-weight: 400;\">embeddings models in their Vertex AI<\/span><\/a><span style=\"font-weight: 400;\"> offering.\u00a0<\/span>\n\n<span style=\"font-weight: 400;\">You should know that Google\u2019s embeddings are a fraction of a cent more expensive than OpenAI\u2019s. OpenAI\u2019s text-embedding-3-small is $0.00002 \/ 1K tokens while Google\u2019s <\/span><code>text-embedding-preview-0409<\/code><span style=\"font-weight: 400;\"> is $0.000025 \/ 1K tokens. If you did 100 million tokens (or the equivalent of two thousand novels not written by Stephen King), you\u2019re spending $2 with OpenAI and $2.50 with Google. However, if you do batch requests to Google, the pricing is exactly the same. <\/span><i><span style=\"font-weight: 400;\">Although, I wonder if this will change based on announcements at the upcoming Google I\/O conference.<\/span><\/i>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5fd2aad elementor-widget elementor-widget-heading\" data-id=\"5fd2aad\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Accounting for Token Limits<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36d842a elementor-widget elementor-widget-text-editor\" data-id=\"36d842a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">The length of your content is also a factor since Google only accepts 3,071 input tokens, whereas OpenAI accepts 8,191. If your content is too long, you\u2019ll get an error message that looks like this:<\/span><\/p><p>\u00a0<\/p><pre><code class=\"language-\">Error: {\n\u00a0\u00a0\"error\": {\n\u00a0\u00a0\u00a0\u00a0\"message\": \"This model's maximum context length is 8192 tokens, however you requested 11738 tokens (11738 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.\",\n\u00a0\u00a0\u00a0\u00a0\"type\": \"invalid_request_error\",\n\u00a0\u00a0\u00a0\u00a0\"param\": null,\n\u00a0\u00a0\u00a0\u00a0\"code\": null\n\u00a0\u00a0}\n}<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af89fb1 elementor-widget elementor-widget-text-editor\" data-id=\"af89fb1\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">In these cases you\u2019d have to chunk the content and manage the embeddings into a single set for our use cases. It\u2019s common practice to <a href=\"https:\/\/randorithms.com\/2020\/11\/17\/Adding-Embeddings.html\"><span class=\"s1\">average the embeddings<\/span><\/a> from the chunks into a single set of embeddings as follows:<\/p>\n&nbsp;\n<pre><code class=\"language-\">const OPENAI_API_KEY = 'your_api_key_here';\nconst userContent = document.body.innerText;\n\nfunction chatGptRequest() {\n    if (new TextEncoder().encode(userContent).length > 8191) { \/\/ Checking byte length approximation for tokens\n        \/\/ Function to break the string into chunks\n        function chunkString(str, size) {\n            const numChunks = Math.ceil(str.length \/ size);\n            const chunks = new Array(numChunks);\n\n            for (let i = 0, o = 0; i < numChunks; ++i, o += size) {\n                chunks[i] = str.substring(o, o + size);\n            }\n            return chunks;\n        }\n\n        \/\/ Divide content into manageable chunks\n        const chunks = chunkString(userContent, 8191);\n\n        \/\/ Function to request batch embeddings for all chunks\n        function chatGptBatchRequest(chunks) {\n            return fetch('https:\/\/api.openai.com\/v1\/embeddings', {\n                method: 'POST',\n                headers: {\n                    'Authorization': `Bearer ${OPENAI_API_KEY}`,\n                    \"Content-Type\": \"application\/json\",\n                },\n                body: JSON.stringify({\n                    model: \"text-embedding-3-small\",\n                    input: chunks,\n                    encoding_format: \"float\",\n                })\n            })\n            .then(response => {\n                if (!response.ok) {\n                    return response.text().then(text => { throw new Error(text); });\n                }\n                return response.json();\n            })\n            .then(data => {\n                if (data.data.length > 0) {\n                    const numEmbeddings = data.data.length;\n                    const embeddingLength = data.data[0].embedding.length;\n                    const sumEmbedding = new Array(embeddingLength).fill(0);\n\n                    data.data.forEach(embed => {\n                        embed.embedding.forEach((value, index) => {\n                            sumEmbedding[index] += value;\n                        });\n                    });\n\n                    const averageEmbedding = sumEmbedding.map(sum => sum \/ numEmbeddings);\n                    return averageEmbedding.toString();\n                } else {\n                    throw new Error(\"No embeddings returned from the API.\");\n                }\n            });\n        }\n\n        \/\/ Make a single batch request with all chunks and process the average\n        return chatGptBatchRequest(chunks);\n    } else {\n        \/\/ Process single embedding request if content is within the token limit\n        return fetch('https:\/\/api.openai.com\/v1\/embeddings', {\n            method: 'POST',\n            headers: {\n                'Authorization': `Bearer ${OPENAI_API_KEY}`,\n                \"Content-Type\": \"application\/json\",\n            },\n            body: JSON.stringify({\n                model: \"text-embedding-3-small\",\n                input: userContent,\n                encoding_format: \"float\",\n            })\n        })\n        .then(response => {\n            if (!response.ok) {\n                 return response.text().then(text => {throw new Error(text)});\n            }\n            return response.json();\n        })\n        .then(data => {\n            console.log(data.data[0].embedding);\n            return data.data[0].embedding.toString();\n        });\n    }\n}\n\n\/\/ Execute request and handle results\nreturn chatGptRequest()\n    .then(embeddings => seoSpider.data(embeddings))\n    .catch(error => seoSpider.error(error));\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e72a879 elementor-widget elementor-widget-text-editor\" data-id=\"e72a879\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Using this code you won\u2019t have to worry about the input length error. However, the 8.1k input tokens should work for most pages.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-465575f elementor-widget elementor-widget-heading\" data-id=\"465575f\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Using Google\u2019s Embeddings<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-474af2e elementor-widget elementor-widget-text-editor\" data-id=\"474af2e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Google\u2019s REST APIs require OAuth (the annoying pop up window for authentication), so it&#8217;s not as simple as just making an HTTP request to an endpoint with an API key like with OpenAI. Since SFSS does not support OAuth for Custom JS (nor does it need to), you&#8217;d have to stand up some middleware between it and the Vertex AI API. What I do is setup a local server with an API that makes the API request to VertexAI<\/p>\n<p class=\"p1\">Here&#8217;s the code to do so using Flask:<\/p>\n\n<pre><code class=\"language-\">import logging\nimport sys\nimport os\nfrom flask import Flask, request, jsonify\nfrom google.auth import load_credentials_from_file\nimport tiktoken \nimport numpy as np\n\nfrom google.cloud import aiplatform\nfrom google.oauth2 import service_account\nfrom typing import List, Optional\nfrom vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel\nimport vertexai.preview\n\napp = Flask(__name__)\n\n\nos.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \u2018[insert path to service account file here]\u2019\n\ndef authenticate():\n    \"\"\"Load credentials from the environment variable.\"\"\"\n    credentials, project = load_credentials_from_file(os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"])\n    return credentials\n\ncredentials = authenticate()\n\ndef token_count(string: str, encoding_name: str) -> int:\n    encoding = tiktoken.get_encoding(encoding_name)\n    num_tokens = len(encoding.encode(string))\n    return num_tokens\n\ndef split_text(text: str, max_tokens: int = 3000) -> List[str]:\n    words = text.split()\n    chunks = []\n    current_chunk = []\n    for word in words:\n        if len(current_chunk) + 1 > max_tokens:\n            chunks.append(' '.join(current_chunk))\n            current_chunk = []\n        current_chunk.append(word)\n    if current_chunk:\n        chunks.append(' '.join(current_chunk))\n    return chunks\n\ndef embed_text(text: str, task: str = \"RETRIEVAL_DOCUMENT\",\n               model_name: str = \"text-embedding-preview-0409\", dimensionality: Optional[int] = 256) -> List[float]:\n    model = TextEmbeddingModel.from_pretrained(model_name)\n    text_chunks = split_text(text)\n    inputs = [TextEmbeddingInput(chunk, task) for chunk in text_chunks]\n    kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}\n    chunk_embeddings = [model.get_embeddings([input], **kwargs) for input in inputs]\n    embeddings = [embedding.values for sublist in chunk_embeddings for embedding in sublist]\n    average_embedding = np.mean(embeddings, axis=0)\n    return average_embedding.tolist()\n\n@app.route('\/embed', methods=['POST'])\ndef handle_embed():\n    data = request.json\n    if not data or 'text' not in data or 'task' not in data:\n        return jsonify({\"error\": \"Request must contain 'text' and 'task' fields\"}), 400\n    text = data['text']\n    task = data['task']\n    try:\n        embedding = embed_text(text, task)\n        return jsonify({\"embedding\": embedding})\n    except Exception as e:\n        return jsonify({\"error\": str(e)}), 500\n\nif __name__ == '__main__':\n    app.run(debug=True, host='0.0.0.0', port=5000)\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d80002 elementor-widget elementor-widget-text-editor\" data-id=\"3d80002\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">To get this working, you&#8217;ll need to install the dependencies, <a href=\"https:\/\/cloud.google.com\/vertex-ai\/docs\/start\/cloud-environment#enable_vertexai_apis\"><span class=\"s1\">enable Vertex AI<\/span><\/a>, and <a href=\"https:\/\/cloud.google.com\/iam\/docs\/keys-create-delete\"><span class=\"s1\">get a service key<\/span><\/a>. Here is the pip install one liner for the dependencies:<\/p><pre><code class=\"language-\">pip install flask numpy google-cloud-aiplatform google-auth tiktoken vertexai<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4d409c elementor-widget elementor-widget-text-editor\" data-id=\"d4d409c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Once you have the server running, you can setup a custom JS extraction to pull the data as follows:<\/p><pre><code class=\"language-\">const userContent = document.body.innerText;\n\nfunction vertextAiRequest() {\n    return fetch('http:\/\/127.0.0.1:5000\/embed', {\n        method: 'POST',\n        headers: {\n            \"Content-Type\": \"application\/json\",\n        },\n        body: JSON.stringify({\n            task: \"RETRIEVAL_DOCUMENT\",\n            text: `${userContent}`\n            })\n    })\n    .then(response =&gt; {\n        if (!response.ok) {\n             return response.text().then(text =&gt; {throw new Error(text)});\n        }\n        return response.json();\n    })\n    .then(data =&gt; {\n        return data.embedding.toString();\n    });\n}\n\nreturn vertextAiRequest()\n    .then(embeddings =&gt; seoSpider.data(embeddings))\n    .catch(error =&gt; seoSpider.error(error));<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7596c5a elementor-widget elementor-widget-text-editor\" data-id=\"7596c5a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Here\u2019s what the Vertex AI embeddings output will look like:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5830773 elementor-widget elementor-widget-image\" data-id=\"5830773\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"566\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-1024x724.png\" class=\"attachment-large size-large wp-image-17515\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-1024x724.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-300x212.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-768x543.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-825x583.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request-945x668.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/vertex-request.png 1122w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe6ed5d elementor-widget elementor-widget-heading\" data-id=\"fe6ed5d\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">What About Open Source options?<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-241d626 elementor-widget elementor-widget-text-editor\" data-id=\"241d626\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">This is all quite inexpensive, but you could also set up an embedding server locally via ollama and generate your embeddings for free using one of the open source pretrained models. For example, if you wanted to use the highly-rated <code>SFR-embedding-mistral<\/code> embeddings model in the same way, follow these steps:<\/p><ol class=\"ol1\"><li class=\"li3\"><span class=\"s1\"><a href=\"https:\/\/ollama.com\/download\"><span class=\"s2\">Download, install, and start ollama<\/span><\/a>\u00a0<\/span><\/li><li class=\"li1\">Confirm that it\u2019s running by going to http:\/\/localhost:11434<\/li><li class=\"li1\">At the command line download and run the model with this command: ollama run avr\/<code>sfr-embedding-mistral<\/code>. To verify things are working properly you can run a query in Postman or with cURL.<\/li><\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f3a4f4 elementor-widget elementor-widget-image\" data-id=\"1f3a4f4\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"736\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-query.png\" class=\"attachment-large size-large wp-image-17514\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-query.png 874w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-query-300x276.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-query-768x706.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-query-825x759.png 825w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2e9eff5 elementor-widget elementor-widget-text-editor\" data-id=\"2e9eff5\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol start=\"4\">\n \t<li class=\"p1\">Once you\u2019ve confirmed it\u2019s running you can use this code as a custom extraction for generating the embeddings.<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a3753e elementor-widget elementor-widget-image\" data-id=\"4a3753e\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"566\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-1024x724.png\" class=\"attachment-large size-large wp-image-17513\" alt=\"Vertex AI Output\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-1024x724.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-300x212.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-768x543.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-825x583.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output-945x668.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Vertex-AI-Output.png 1122w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e16e7a4 elementor-widget elementor-widget-text-editor\" data-id=\"e16e7a4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Here\u2019s the code:<\/p><pre><code class=\"language-\">const userContent = document.body.innerText;\n\nfunction getEmbeddings(userContent) {\n    const apiUrl = 'http:\/\/localhost:11434\/api\/embeddings';\n\n    const postData = {\n        \"model\": \"avr\/sfr-embedding-mistral\",\n        \"prompt\": userContent,\n    };\n\n    const fetchOptions = {\n        method: 'POST',\n        headers: {\n            'Content-Type': 'application\/json'\n        },\n        body: JSON.stringify(postData)\n    };\n\n    return fetch(apiUrl, fetchOptions)\n        .then(response =&gt; {\n            if (!response.ok) {\n                return response.text().then(text =&gt; {throw new Error(text)});\n            }\n            return response.json();\n        })\n        .then(data =&gt; {\n            return data.embedding;\n        });\n}\n\nreturn getEmbeddings(userContent)\n  .then(embeddings =&gt; seoSpider.data(embeddings))\n  .catch(error =&gt; seoSpider.error(error));\n  \n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-060a822 elementor-widget elementor-widget-text-editor\" data-id=\"060a822\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Depending on the specs of your machine, ollama may slow down your crawl too much to use it for generating embeddings. It may timeout before the data is returned. Make sure to test it on a few URLs before you let your crawl go.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3e3f5b3 elementor-widget elementor-widget-heading\" data-id=\"3e3f5b3\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Preparing Your Embeddings for Analysis<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e253a4a elementor-widget elementor-widget-text-editor\" data-id=\"e253a4a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Before we get into use cases, you need to know how to prepare the data for analysis. Embeddings are stored in SFSS as comma-separated strings, but embeddings are numerical and need to be converted back into floats to be used for analysis. I prefer to export the XLSX file rather than a CSV because too many tools have done me dirty when it comes to formatting. I don&#8217;t want any potential formatting issues to damage my hard-won data. However, my testing has shown that CSVs can work just fine here too.<\/p>\n<p class=\"p1\">Nevertheless, the conversion is simple with numpy. Here&#8217;s a function to make it happen after you load your file into a dataframe:<\/p>\n<pre><code class=\"language-\">def convert_strings_to_float(df, col, new_col_name):\n  df = df[df[col].isna() == False]\n  df[new_col_name] = df[col].str.split(',')\n  df[new_float_col] = df[new_float_col].apply(lambda x: np.float64(x))\n  df['EmbeddingLength'] = df[new_float_col].apply(lambda x: x.size)\n\n  return df\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd2844d elementor-widget elementor-widget-text-editor\" data-id=\"bd2844d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Now you have your embeddings in a dataframe and ready to use for analysis.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-89a8307 elementor-widget elementor-widget-heading\" data-id=\"89a8307\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Indexing Your Vector Embeddings<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aa214cb elementor-widget elementor-widget-text-editor\" data-id=\"aa214cb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">The first thing we want to do is build an index of the vectors so we can search them for various use cases. For vector searching, we\u2019ll use Google\u2019s SCaNN package. Here\u2019s the code:<\/p>\n<pre><code class=\"language-\">def scann_search(dataset:np.ndarray, queries: np.ndarray, n_neighbors = 10, distance_measure = \"dot_product\", num_leaves = 2000, num_leaves_to_search = 100):\n  normalized_dataset = dataset \/ np.linalg.norm(dataset, axis=1)[:, np.newaxis]\n\n  searcher = scann.scann_ops_pybind.builder(normalized_dataset, n_neighbors, distance_measure).tree(\n      num_leaves=num_leaves, num_leaves_to_search=num_leaves_to_search, training_sample_size=250000).score_ah(\n      2, anisotropic_quantization_threshold=0.2).reorder(100).build()\n\n  return searcher\n\ndef convert_scann_arrays_to_urls(arrays: np.array, df: pd.DataFrame,column):\n    results = []\n    for arr in arrays:\n      results.append(df.iloc[arr.flatten()][column].tolist())\n    return results\n\nsiteDf = siteDf[siteDf['openAiEmbeddings'].isna() == False]\nsiteDf['openAiEmbeddingsAsFloats'] = siteDf['openAiEmbeddings'].str.split(',')\nsiteDf['openAiEmbeddingsAsFloats'] = siteDf['openAiEmbeddingsAsFloats'].apply(lambda x: np.float64(x))\nsiteDf['EmbeddingLength'] = siteDf['openAiEmbeddingsAsFloats'].apply(lambda x: x.size)\n\nif siteDf['EmbeddingLength'].unique().size == 1:\n  d = siteDf['EmbeddingLength'].unique() #Number of dimensions for each value\nelse:\n  print('Dimensionality reduction required to make all arrays the same size.')\n\ndataset = np.vstack(siteDf['openAiEmbeddingsAsFloats'].values)\nqueries = dataset\n\nsiteSearcher = scann_search(dataset, queries)\nsiteSearcher.serialize(index_directory+'\/site_scann_index')\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b39d1af elementor-widget elementor-widget-text-editor\" data-id=\"b39d1af\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">I&#8217;m using SCaNN, but you could use another package like Facebook\u2019s <a href=\"https:\/\/github.com\/facebookresearch\/faiss\"><span class=\"s1\">FAISS<\/span><\/a> or Spotify\u2019s <a href=\"https:\/\/github.com\/spotify\/annoy\"><span class=\"s1\">Annoy<\/span><\/a>.<\/p><p class=\"p1\">Note: If you don\u2019t want to do this with Python, you could also push the data to BigQuery and use <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/vector-search-intro\"><span class=\"s1\">its engine for vector searches<\/span><\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e86b520 elementor-widget elementor-widget-heading\" data-id=\"e86b520\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Vectorizing your Keyword List<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3cc8941 elementor-widget elementor-widget-text-editor\" data-id=\"3cc8941\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">In the vector space model, vectors for queries are compared to vectors for documents to determine what are the most relevant documents for a user\u2019s search. So, for much of your comparative analysis, you will want to vectorize your list of keywords to compare against with nearest neighbor searches and other operations. You can use similar code on a CSV of keywords with their landing pages. We\u2019ll want to maintain the landing pages so we can compare against the pages that are considered the most relevant.<\/p><p class=\"p1\">Here is the approach to doing it with OpenAI using an export of keyword data from Semrush:<\/p><pre><code class=\"language-\"># Function to get embeddings and flatten them for SCANN\ndef get_openai_embeddings(keyword):\n    response = openai.embeddings.create(\n        input=keyword,\n        model=\"text-embedding-3-small\"  # Make sure to use the same embeddings as Screaming Frog\n    )\n    # Extract and flatten the embedding\n    embedding_vector = response.data[0].embedding\n    return np.array(embedding_vector).flatten()\n\nsemrushFile = 'ipullrank.com-organic.Positions-us-20220415-2024-05-05T16_03_31Z.csv'\nkeywordDf = read_file(semrushFile, 'CSV')\ndisplay(keywordDf)\n\n# Loop through the DataFrame and get embeddings for each keyword\nembeddings = []\nfor keyword in keywordDf['Keyword']:\n    embeddings.append(get_openai_embeddings(keyword))\n\nkeywordDf['embeddings'] = embeddings\n\n# Create a temporary DataFrame for Excel output with embeddings converted to strings\ntempDf = keywordDf.copy()\ntempDf['embeddings'] = tempDf['embeddings'].apply(lambda x: str(x))\ntempDf.to_excel('semrush-embeddings.xlsx', index=False) # Save with embeddings as strings\n\n# Display the updated DataFrame\nprint(keywordDf.head())\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b65e95 elementor-widget elementor-widget-text-editor\" data-id=\"6b65e95\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">If you&#8217;re using Google\u2019s embeddings they make a specific distinction between document and query embeddings. So, the code we used earlier will require the <code>\u201cRETRIEVAL_QUERY\u201d<\/code> task type to be specified. The only change that we make is calling the <code>embed_text()<\/code> function with the task variable set to <code>RETRIEVAL_QUERY<\/code>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-be02b02 elementor-widget elementor-widget-image\" data-id=\"be02b02\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"740\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-embeddings.png\" class=\"attachment-large size-large wp-image-17516\" alt=\"postman embeddings\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-embeddings.png 867w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-embeddings-300x278.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-embeddings-768x710.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/postman-embeddings-825x763.png 825w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3fa67c6 elementor-widget elementor-widget-text-editor\" data-id=\"3fa67c6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">Here\u2019s the adjustment to the code to make that happen:<\/span><\/p><pre><code class=\"language-\"># Loop through the DataFrame and get embeddings for each keyword\nembeddings = []\nfor keyword in keywordDf['Keyword']:\n    <strong>embeddings.append(embed_text(keyword,\u201dRETRIEVAL_QUERY\u201d)\n<\/strong><\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bcb7713 elementor-widget elementor-widget-text-editor\" data-id=\"bcb7713\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Now let\u2019s create a SCaNN index of the keyword list:<\/p>\n\n<pre><code class=\"language-\">embeddings_matrix = np.vstack(keywordDf['embeddings'])\n\nkeywordSearcher = scann.scann_ops_pybind.builder(embeddings_matrix, 10, \"dot_product\").tree(\n    num_leaves=200, num_leaves_to_search=100, training_sample_size=250000\n).score_ah(2, anisotropic_quantization_threshold=0.2).reorder(100).build()\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-da3bfc4 elementor-widget elementor-widget-text-editor\" data-id=\"da3bfc4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">In case you&#8217;re wondering, you should not compare embeddings from different sources because they are not the same length nor are they composed by the same language model. Be consistent with the embeddings model that you use across your analysis.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35b2ece elementor-widget elementor-widget-heading\" data-id=\"35b2ece\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">SEO Use Cases for Vectorized Crawls<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2b0680 elementor-widget elementor-widget-text-editor\" data-id=\"d2b0680\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Ok, now we can unlock some new capabilities that can enhance the level of analysis we can do. Typically, machine learning engineers use embeddings to do a variety of things, including:<\/p><ul class=\"ul1\"><li class=\"li1\"><b>Clustering &#8211;\u00a0 <\/b>Clustering is the process of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.<\/li><li class=\"li1\"><b>Classification &#8211; <\/b>Classification involves assigning categories to objects based on input features, using trained models to predict the category for new, unseen data. I\u2019m not going to cover classification today because that is worth its own post and I want to collect more spam and helpful content data and show you how to build an embeddings based classifier.<\/li><li class=\"li1\"><b>Recommendations &#8211;<\/b> Recommendation systems suggest relevant items to users based on their preferences and past behavior.<\/li><li class=\"li1\"><b>Similarity and Diversity Measurement &#8211;<\/b> This involves assessing how similar or different objects are from each other, often used in systems that need to understand variations or patterns among data points.<\/li><li class=\"li1\"><b>Anomaly Detection &#8211;<\/b> Anomaly detection identifies rare items, events, or observations which raise suspicions by differing significantly from the majority of the data.<\/li><li class=\"li1\"><b>Information Retrieval &#8211;<\/b> Information retrieval is the process of obtaining relevant information from a collection of resources that satisfies the information need from within large datasets.<\/li><li class=\"li1\"><b>Machine Translation &#8211;<\/b> Machine translation automatically translates text from one language to another, using complex models to understand and convert languages.<\/li><li class=\"li1\"><b>Text Generation &#8211;<\/b> Text generation is the process of automatically producing text, often mimicking human-like writing, using various algorithms and statistical techniques.<\/li><\/ul><p class=\"p1\">This new SF feature unlocks your ability to apply these techniques to drive deep insights for SEO.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-12812b5 elementor-widget elementor-widget-heading\" data-id=\"12812b5\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Keyword Mapping<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31fb457 elementor-widget elementor-widget-text-editor\" data-id=\"31fb457\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">The keyword-to-keyword and keyword-to-page relationships are the most important aspects we can directly impact as content creators and SEOs. Optimizing pages to improve their keyword targeting through copy adjustments and linking strategies is best reinforced through a determination of what page owns what keywords. In some cases, due to the array of ranking factors, you\u2019ll find that what ranks for the keyword is not the best page on your site. To remedy that at scale, you can loop through your keyword vector embeddings and perform nearest neighbor searches on your document SCaNN index. Wherever the highest ranking URL does not match the current landing page, that\u2019s a linking opportunity for optimization.<\/p><p class=\"p1\">To do that we perform the search, add the URL to the dataframe and an indication of whether it\u2019s a match. Very quickly we have an understanding of how where we need to improve our keyword targeting.<\/p><pre><code class=\"language-\">queries = np.vstack(keywordDf['embeddings'].values) #Stacking all individual embeddings vertically into matrix\n\nkwSearcher = scann_search(dataset, queries) # dataset is the same as before\nnearest_neighbors = kwSearcher.search_batched(queries, final_num_neighbors=1)\nmatched_urls = convert_scann_arrays_to_urls(nearest_neighbors, siteDf, 'Address')\n\nkeywordDf['BestMatchURL'] = convert_scann_arrays_to_urls(neighbors, siteDf, 'Address')\nkeywordDf['BestMatchURL'] = keywordDf['BestMatchURL'].apply(lambda x: x[:1][0])\n\ndisplay(keywordDf)\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-50cb1f0 elementor-widget elementor-widget-text-editor\" data-id=\"50cb1f0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">When we run this, the data tells me that our <a href=\"https:\/\/ipullrank.com\/resources\/guides-ebooks\/modern-enterprise-seo-guide\">guide to enterprise SEO<\/a> is considered more relevant for enterprise SEO queries than our <a href=\"https:\/\/ipullrank.com\/enterprise-seo\">enterprise SEO landing page<\/a>. Granted, our enterprise SEO page performs better, but looks like we need to optimize our content a bit better if we want the enterprise SEO landing page to rank better.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2583dee elementor-widget elementor-widget-image\" data-id=\"2583dee\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"333\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-1024x426.png\" class=\"attachment-large size-large wp-image-17517\" alt=\"enterprise seo keywords\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-1024x426.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-300x125.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-768x319.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-1536x639.png 1536w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-825x343.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords-945x393.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/enterprise-seo-keywords.png 1683w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccebd86 elementor-widget elementor-widget-heading\" data-id=\"ccebd86\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Keyword Relevance Calculations <\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b96f194 elementor-widget elementor-widget-text-editor\" data-id=\"b96f194\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">The measure of relevance is the function of distance between embeddings. That is calculated in several ways: euclidean distance, dot product, and my preference, cosine similarity. I prefer it because of its simplicity and the ease of converting it into a score between 0 and 100. With the keyword and URL embeddings we can compare the mapped keyword to the URL to determine how relevant it is. You could also crawl competitor pages with SFSS and do the same. The comparison is simple. Find the embeddings for the URL and the keyword in their respective dataframes and perform cosine similarity.<\/p><pre><code class=\"language-\"># Function to normalize embeddings\ndef normalize_embeddings(embeddings):\n    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)\n    return (embeddings \/ norms).tolist()  # Normalize and convert to list\n\n# Normalize the embeddings and convert them to lists for DataFrame storage\nkeywordDf['NormalizedEmbeddings'] = normalize_embeddings(np.vstack(keywordDf['embeddings'].values))\nsiteDf['NormalizedEmbeddings'] = normalize_embeddings(np.vstack(siteDf['OpenAI Embeddings 1ConvertedFloats'].values))\n\n# Function to calculate cosine similarity\ndef cosine_similarity(embedding1, embedding2):\n    return np.dot(embedding1, embedding2)\n\n# Initialize a list to store cosine similarity results\ncosine_similarities = []\nrelevance_values = []\n\n# Loop through each keyword to calculate cosine similarity with its corresponding URL in siteDf\nfor index, row in keywordDf.iterrows():\n    keyword_url = row['URL']\n    keyword_embedding = row['NormalizedEmbeddings']  # This is now a list\n\n    # Find the corresponding URL in siteDf\n    if keyword_url in siteDf['Address'].values:\n        # Get the embedding for the matching URL, which is also stored as a list\n        url_embedding = siteDf.loc[siteDf['Address'] == keyword_url, 'NormalizedEmbeddings'].iloc[0]\n        # Convert list to numpy array for calculation\n        similarity = cosine_similarity(np.array(keyword_embedding), np.array(url_embedding))\n        relevance = similarity * 100\n    else:\n        similarity = None  # Set similarity to None if no matching URL is found\n        relevance = None\n\n    cosine_similarities.append(similarity)\n    relevance_values.append(relevance)\n\n# Store the cosine similarities in the keywordDf\nkeywordDf['CosineSimilarity'] = cosine_similarities\nkeywordDf['Relevance'] = relevance_values\n\n\n# Display or use the updated DataFrame\nprint(keywordDf[['Keyword', 'URL', 'CosineSimilarity','Relevance']])\n\nkeywordDf.to_excel('keyword-relevance.xlsx')\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-122286a elementor-widget elementor-widget-text-editor\" data-id=\"122286a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">This is what our tool <a href=\"https:\/\/ipullrank.com\/tools\/orbitwise\/\">Orbitwise<\/a> does.\u00a0<\/p><p class=\"p1\">What you&#8217;re seeing here is an indication of low middling relevance for these keywords versus these landing pages.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9e77a1 elementor-widget elementor-widget-image\" data-id=\"b9e77a1\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"580\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-1024x743.png\" class=\"attachment-large size-large wp-image-17518\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-1024x743.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-300x218.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-768x557.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-825x599.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity-945x686.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/relevance-cosine-similarity.png 1130w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e1a77bf elementor-widget elementor-widget-heading\" data-id=\"e1a77bf\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Internal Linking and Redirect Mapping<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-06b4a9d elementor-widget elementor-widget-text-editor\" data-id=\"06b4a9d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Link relevance is about parity and the higher the relationship between the source and target URLs, the more valuable the link.<\/p>\n<p class=\"p1\">When Overstock was migrating to bedbathandbeyond.com, <a href=\"https:\/\/ipullrank.com\/how-ipullrank-would-migrate-overstock-com-to-bedbathandbeyond-com\"><span class=\"s1\">I talked about how they could map the redirects as scale using nearest neighbor searches<\/span><\/a>. This same concept can be applied to identifying where to build internal links.<\/p>\n<p class=\"p1\">As we have mapped our keywords to landing pages, we now have an understanding of the best pages to own which keywords. Such analysis is especially useful when dealing with millions of pages versus millions of keywords. Assuming we want to build 10 links from different pages across the site, we can determine internal link sources for a given page by using keyword searches on the document index or we can do it by doing document searches on the document index.<\/p>\n<p class=\"p1\">The code is the same as what we did for keyword mapping, we just want more results. Let\u2019s search for 10 neighbors this time.<\/p>\n\n<pre><code class=\"language-\"># Search siteDf for keywords, return 10 neighbors per keyword\n\nqueries = np.vstack(keywordDf['embeddings'].values) #Stacking all individual embeddings vertically into matrix\n\nkwSearcher = scann_search(dataset, queries) # dataset is the same as before\nneighbors, distances = siteSearcher.search_batched(queries, leaves_to_search = 150)\n\nnearest_neighbors = kwSearcher.search_batched(queries, final_num_neighbors=5)\n\nmatched_urls = convert_scann_arrays_to_urls(nearest_neighbors, siteDf, 'Address')\n\nkeywordDf['InternalLinkSuggestions'] = convert_scann_arrays_to_urls(neighbors, siteDf, 'Address')\nkeywordDf['InternalLinkSuggestions'] = keywordDf['InternalLinkSuggestions'].apply(lambda x: x[1:])\ndisplay(keywordDf)\n\n# Create a temporary DataFrame for Excel output with embeddings converted to strings\ntempDf = keywordDf.copy()\ntempDf['embeddings'] = tempDf['embeddings'].apply(lambda x: str(x))\ntempDf.to_excel('keyword-internal-link-mapping.xlsx', index=False) # Save with embeddings as strings\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-18e4a7d elementor-widget elementor-widget-text-editor\" data-id=\"18e4a7d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">For the document version, we select the document embedding and use it to perform the search on the document index.<\/p><pre><code class=\"language-\">queries = dataset\n\nsiteSearcher = scann_search(dataset, queries)\n\nneighbors, distances = siteSearcher.search_batched(queries, leaves_to_search = 150)\n\nnearest_neighbors = siteSearcher.search_batched(queries, final_num_neighbors=10)\n\nmatched_urls = convert_scann_arrays_to_urls(nearest_neighbors, siteDf, 'Address')\n\nsiteDf['PageToPageLinkMapping'] = convert_scann_arrays_to_urls(neighbors, siteDf, 'Address')\nsiteDf['PageToPageLinkMapping'] = siteDf['PageToPageLinkMapping'].apply(lambda x: x[2:])\ndisplay(siteDf)\nsiteDf.to_excel('page-to-page-link-mapping.xlsx')\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b99d6e2 elementor-widget elementor-widget-text-editor\" data-id=\"b99d6e2\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Here are the results for our Enterprise SEO page based on the page to page calculations.<\/p>\n<pre><code class=\"language-\">['https:\/\/ipullrank.com\/resources\/guides-ebooks\/modern-enterprise-seo-guide\/chapter-1', 'https:\/\/ipullrank.com\/', 'https:\/\/ipullrank.com\/seo-for-the-procurement-professional', 'https:\/\/ipullrank.com\/services\/technical-seo', 'https:\/\/ipullrank.com\/services', 'https:\/\/ipullrank.com\/author\/andrew-mcdermott\/page\/3', 'https:\/\/ipullrank.com\/resources\/guides-ebooks\/modern-enterprise-seo-guide', 'https:\/\/ipullrank.com\/11-common-enterprise-seo-problems-and-solutions']<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b13c67 elementor-widget elementor-widget-text-editor\" data-id=\"6b13c67\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">For redirect mapping, you\u2019d crawl the old site and the new site to generate embeddings for both. Then search the target site\u2019s index with embeddings to the target site with top k set to 1. Using this data, you can determine the redirect relationships and limit what Google might perceive as soft 404s.<\/p><pre><code class=\"language-\">migratingSiteDf = read_file('migrating-site.xlsx', 'Excel')\nmigratingSiteDf = migratingSiteDf[migratingSiteDf['OpenAIEmbeddings'].isna() == False]\nmigratingSiteDf = migratingSiteDf[~migratingSiteDf['OpenAIEmbeddings'].str.contains('error')]\nmigratingSiteDf['OpenAIEmbeddingsFloats'] = migratingSiteDf['OpenAIEmbeddings'].str.split(',')\n\nmigratingSiteDf['OpenAIEmbeddingsFloats'] = migratingSiteDf['OpenAIEmbeddingsFloats'].apply(lambda x: np.array(x, dtype = float))\n\nqueries = np.vstack(migratingSiteDf['OpenAIEmbeddingsFloats'].values) #Stacking all individual embeddings vertically into matrix\n\nmigrationSearcher = scann_search(dataset, queries) # dataset is the same as before\nneighbors, distances = migrationSearcher.search_batched(queries, leaves_to_search = 150)\n\nnearest_neighbors = migrationSearcher.search_batched(queries, final_num_neighbors=1)\n\nmigratingSiteDf['MigrationTargetSuggestions'] = convert_scann_arrays_to_urls(neighbors, siteDf, 'Address')\nmigratingSiteDf['MigrationTargetSuggestions'] = migratingSiteDf['MigrationTargetSuggestions'].apply(lambda x: x[:1][0])\n\ndisplay(migratingSiteDf)\nmigratingSiteDf.to_excel('migration-recommendations.xlsx')\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-530fa70 elementor-widget elementor-widget-heading\" data-id=\"530fa70\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Link Building Target Identification<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a597c7e elementor-widget elementor-widget-text-editor\" data-id=\"a597c7e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">This is another story for another day, but I do not believe the <a href=\"https:\/\/ipullrank.com\/less-backlinks-better-rankings\">volume approach for link building<\/a> works anymore. On the back of the advancements in natural language processing, Google is better at understanding and modeling relevance parity between the source and target of links. Links built from sources that are completely irrelevant to the subject matter are invalidated in modern PageRank calculations. My hypothesis is that this is an aspect of how SpamBrain works.<\/p><p class=\"p1\">To that end, we can vectorize a list of pages we are considering for link building and compare them against the target page using cosine similarity to determine how relevant the page source page of the link is.<\/p><p class=\"p1\">For this process, we\u2019d:<\/p><ol class=\"ol1\"><li class=\"li1\">Identify a series of link targets using a tool like Ahrefs, Semrush, Pitchbox, or Respona<\/li><li class=\"li1\">Crawl those pages with Screaming Frog to collect their embeddings.\u00a0<\/li><li class=\"li1\">Compare them against the embeddings for your site to get the cosine similarity.<\/li><\/ol><p class=\"p1\">That yields a table that looks like the one below. When I sort ascending, that lets me know all the URLs that are not good fits for me to get links from. When we look at the scores, if they are not a 0.6 or higher, they are not relevant enough to build links from.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e92e466 elementor-widget elementor-widget-image\" data-id=\"e92e466\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"440\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-1024x563.png\" class=\"attachment-large size-large wp-image-17519\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-1024x563.png 1024w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-300x165.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-768x422.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-1536x844.png 1536w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-825x453.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold-945x519.png 945w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/cosine-similarity-score-threshhold.png 1563w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8c48cdb elementor-widget elementor-widget-text-editor\" data-id=\"8c48cdb\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">Here\u2019s the code to make it happen:<\/span><\/p><pre><code class=\"language-\">import pandas as pd\nimport numpy as np\nfrom scipy.spatial.distance import cdist\n\n# Load the DataFrame\nlinkProspectsDf = pd.read_excel('link-prospects.xlsx')\nlinkProspectsDf = linkProspectsDf[linkProspectsDf['OpenAI Embeddings Long Inputs 1'].notna()]\nlinkProspectsDf = linkProspectsDf[~linkProspectsDf['OpenAI Embeddings Long Inputs 1'].str.contains('error|TypeError', regex=True)]\n\n# Convert the string of numbers into a list of floats\ndef convert_embeddings(embedding_str):\n    try:\n        # Split the string into a list of strings, then convert each to float\n        return np.array([float(num) for num in embedding_str.split(',')])\n    except ValueError:\n        # Return None or np.nan in case of conversion failure, which should be handled or filtered later\n        return np.nan\n\nlinkProspectsDf['OpenAIEmbeddingsFloats'] = linkProspectsDf['OpenAI Embeddings Long Inputs 1'].apply(convert_embeddings)\n\n# Remove rows where embeddings conversion failed (if any)\nlinkProspectsDf.dropna(subset=['OpenAIEmbeddingsFloats'], inplace=True)\n\n# Normalize the embeddings\nlinkProspectsDf['normalized_embeddings'] = linkProspectsDf['OpenAIEmbeddingsFloats'].apply(lambda x: x \/ np.linalg.norm(x))\nsiteDf['normalized_embeddings'] = siteDf[new_float_col].apply(lambda x: x \/ np.linalg.norm(x))\n\n# Specific URL to search for\nspecific_url = 'https:\/\/ipullrank.com\/enterprise-seo'  # Change this to your specific URL\n\n# Retrieve the normalized embedding for the specific URL\nspecific_embedding = siteDf[siteDf['Address'] == specific_url]['normalized_embeddings'].values[0]\n\n# Prepare the embeddings array from the second dataframe\nembeddings2 = np.stack(linkProspectsDf['normalized_embeddings'].values)\n\n# Calculate cosine similarity\ncosine_similarity_scores = 1 - cdist([specific_embedding], embeddings2, 'cosine')[0]\n\n# Create a dataframe to store the results\nresults = pd.DataFrame({\n    'Search Address': specific_url,\n    'Target Address': linkProspectsDf['Address'],\n    'Cosine Similarity Score': cosine_similarity_scores\n})\n\n# Optionally, sort the results by scores\nresults = results.sort_values(by='Cosine Similarity Score', ascending=False)\nresults.to_excel('link-prospect-relevance.xlsx')\ndisplay(results)\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1af39bd elementor-widget elementor-widget-heading\" data-id=\"1af39bd\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Clustering Content<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2e732e8 elementor-widget elementor-widget-text-editor\" data-id=\"2e732e8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">Clustering the content can help us form topical clusters and also identify anomalies where the content is not relevant to any other content on the site. As sites grow larger, it becomes more difficult to manage how often the same topics are covered. Using <a href=\"https:\/\/maartengr.github.io\/BERTopic\/getting_started\/representation\/llm.html\"><span class=\"s1\">BERTopic<\/span><\/a> with our embeddings we can build and visualize a topical map of our content. When thinking about how you might want to do some content pruning to further reinforce your clusters this is a great approach.<\/p><p class=\"p1\">When we run clustering on the embeddings using BERTopic it automatically puts the content into meaningful groups. BERTopic integrates with ChatGPT which allows you to generate human-readable names of the topic that was modeled. This is a vast improvement over other topical modeling approaches that use keywords from the content as representations rather than user-friendly labels.\u00a0<\/p><p class=\"p1\">Once we run our clustering we can visualize them a few different ways. First as a clustered scatter plot:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9538953 elementor-widget elementor-widget-image\" data-id=\"9538953\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"648\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/content-topics-visualization-with-t-SNE.png\" class=\"attachment-large size-large wp-image-17520\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/content-topics-visualization-with-t-SNE.png 854w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/content-topics-visualization-with-t-SNE-300x243.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/content-topics-visualization-with-t-SNE-768x622.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/content-topics-visualization-with-t-SNE-825x669.png 825w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5162cd6 elementor-widget elementor-widget-text-editor\" data-id=\"5162cd6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">We can also quickly look at the distribution of topics in a bar chart.<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b503781 elementor-widget elementor-widget-image\" data-id=\"b503781\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/topic-probability-distribution.png\" class=\"attachment-large size-large wp-image-17521\" alt=\"topic probability distribution\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/topic-probability-distribution.png 800w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/topic-probability-distribution-300x225.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/topic-probability-distribution-768x576.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-47c98a6 elementor-widget elementor-widget-text-editor\" data-id=\"47c98a6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">And, we can cluster the topics hierarchically.\u00a0<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f54dbd4 elementor-widget elementor-widget-image\" data-id=\"f54dbd4\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"280\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering.png\" class=\"attachment-large size-large wp-image-17522\" alt=\"\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering.png 1000w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering-300x105.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering-768x269.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering-825x289.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/hierarchical-clustering-945x331.png 945w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-275228c elementor-widget elementor-widget-text-editor\" data-id=\"275228c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">You can also see how topics are related and not related to each other:\u00a0<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-79ea1af elementor-widget elementor-widget-image\" data-id=\"79ea1af\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"480\" src=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2.png\" class=\"attachment-large size-large wp-image-17523\" alt=\"intertopic distance map\" srcset=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2.png 1000w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2-300x180.png 300w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2-768x461.png 768w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2-825x495.png 825w, https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Garrett-Sussman-2-945x567.png 945w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-158b08c elementor-widget elementor-widget-text-editor\" data-id=\"158b08c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">To do this we\u2019ll also need the content itself so we can extract features from it to name the clusters. Capturing the content via SFSS is trivial. The code for the custom function is a one liner:<\/span><\/p><pre><code class=\"language-\">return seoSpider.data(document.body.innerText);<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-114e1b4 elementor-widget elementor-widget-text-editor\" data-id=\"114e1b4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<span style=\"font-weight: 400;\">If you did not capture the content before, that\u2019s fine we can just merge the two dataframes as you see in the code below:<\/span><span style=\"font-weight: 400;\">\n<\/span>\n<pre><code class=\"language-\">def cluster_and_visualize_content(df, embeddings_col):\n    print(\"Starting the topic modeling process for keywords...\\n\")\n\n    # Prepare data\n    df['Page Content'] = df['Page Content'].astype(str)\n    keywords = df['Page Content'].tolist()\n    embeddings = np.vstack(df[embeddings_col].tolist())  # Ensure embeddings are properly shaped\n    embeddings = normalize(embeddings)  # Normalize embeddings for cosine similarity\n\n    prompt = \"\"\"\n      I have topic that is described by the following keywords: [KEYWORDS]\n      I am attempting to categorize this topic as part of 2-4 word taxonomy label that encapsulates all the keywords.\n      Based on the above information, can you give a short taxonomy label of the topic? Just return the taxonomy label itself.\n      \"\"\"\n    client = openai.OpenAI(api_key=openai.api_key)\n    representation_model = OpenAI(client, model=\"gpt-3.5-turbo\", prompt=prompt,chat=True)\n    # Initialize BERTopic\n    topic_model = BERTopic(representation_model=representation_model,calculate_probabilities=True)\n\n    # Fit BERTopic\n    topics, probabilities = topic_model.fit_transform(keywords, embeddings)\n    df['topic'] = topics  # Adding topic numbers to the DataFrame\n\n    # Visualize the topics with t-SNE\n    print(\"Reducing dimensions for visualization...\")\n    tsne = TSNE(n_components=2, random_state=42, metric='euclidean')\n    reduced_embeddings = tsne.fit_transform(embeddings)\n\n    plt.figure(figsize=(10, 8))\n    plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=topics, cmap='viridis', s=50, alpha=0.6)\n    plt.colorbar()\n    plt.title('Content Topics Visualization with t-SNE')\n    plt.xlabel('t-SNE Feature 1')\n    plt.ylabel('t-SNE Feature 2')\n    plt.show()\n\n    # Probability distribution visualization\n    min_probability = 0.01\n    if any(probabilities[0] > min_probability):\n        print(\"Visualizing topic probabilities...\")\n        fig = topic_model.visualize_distribution(probabilities[0], min_probability=min_probability)\n        fig.show()\n    else:\n        print(\"No topic probabilities above the threshold to visualize.\")\n\n    # Intertopic distance map\n    print(\"Visualizing intertopic distance map...\")\n    fig = topic_model.visualize_topics()\n    fig.show()\n\n    # Hierarchical clustering\n    print(\"Visualizing hierarchical clustering...\")\n    fig = topic_model.visualize_hierarchy()\n    fig.show()\n\n    # Extract and name topics\n    df['topic_name'] = df['topic'].apply(lambda x: topic_model.get_topic(x)[0][0] if topic_model.get_topic(x) else 'No dominant topic')\n\n    # Display DataFrame with topic names\n    display(df)\n\n    # Export the DataFrame with topic labels\n    df.to_excel('content-clusters-bertopic.xlsx', index=False)\n\npageContentDf = read_file('ipr-content.xlsx', 'Excel')\n\ncontentEmbeddingsDf = siteDf.merge(pageContentDf, on='Address', how='inner')\n#print(contentEmbeddingsDf)\n\ncluster_and_visualize_content(contentEmbeddingsDf, 'OpenAI Embeddings 1ConvertedFloats')\n<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2984eee elementor-widget elementor-widget-heading\" data-id=\"2984eee\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The Value of a Vector Index of the Web<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f73982 elementor-widget elementor-widget-text-editor\" data-id=\"1f73982\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">The democratization of the link graph gave us a series of measures that allowed us to understand the value of websites in the way that Google attributes authority. Granted, those metrics are only approximations of what Google may use, but they have driven the SEO space for nearly two decades.<\/p><p class=\"p1\">And, that was enough prior to Google Search\u2019s transition to becoming a heavily machine learning-driven environment. In a hybrid fusion environment, the link graph matters less because Google is taking signals derived from vector embeddings and using them to inform ranking.\u00a0<\/p><p class=\"p1\">Dare I say, the link graph and link indices are less valuable than they were in the past. Whereas all of the above could be native functionality for link indices that make them more valuable for doing SEO moving forward.<\/p><p class=\"p1\">Until someone gives us such an index, Screaming Frog has armed us with what we need to catch up to Google.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0606dd4 elementor-widget elementor-widget-heading\" data-id=\"0606dd4\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">So, What Are Your Use Cases?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-15f3596 elementor-widget elementor-widget-text-editor\" data-id=\"15f3596\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"p1\">The shortcomings of SEO software has yielded a strong community of Python SEOs. People have been leveraging state of the art technologies to cover the chasm between what SEO software can do and what Google does do.<\/p><p class=\"p1\">So, I\u2019m curious, what are your use cases for vector embeddings? How do you anticipate that Screaming Frog\u2019s new feature will help you do your job even better? In the meantime, you can play with all the code I shared in <a href=\"https:\/\/colab.research.google.com\/drive\/1Na7iU7i-EW4SoEVV3JxeUIYAqHhpf9-8?usp=sharing\"><span class=\"s1\">this Colab<\/span><\/a> and contribute your own custom JavaScript snippets at <a href=\"https:\/\/github.com\/ipullrank\/SFSS-Custom-Extractions\"><span class=\"s1\">this GitHub<\/span><\/a>. I\u2019ll be back soon with some classification use cases.<\/p><p class=\"p1\">Let me know if there\u2019s anything you want me to cook up for you.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b1b263 elementor-widget elementor-widget-template\" data-id=\"4b1b263\" data-element_type=\"widget\" data-widget_type=\"template.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-template\">\n\t\t\t\t\t<div data-elementor-type=\"section\" data-elementor-id=\"17351\" class=\"elementor elementor-17351\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-51a09b09 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"51a09b09\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-2238df9f\" data-id=\"2238df9f\" data-element_type=\"column\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7fba21b9 elementor-widget elementor-widget-heading\" data-id=\"7fba21b9\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Next Steps<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d80b72f elementor-widget elementor-widget-text-editor\" data-id=\"d80b72f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"font-weight: 400;\">Here are three ways iPullRank can help you combine SEO and content to earn visibility for your business and drive revenue:<\/span><\/p><ol><li style=\"font-weight: 400;\" aria-level=\"1\"><b>Schedule a 30-Minute Strategy Session: <\/b><span style=\"font-weight: 400;\">Share your biggest SEO and content challenges so we can put together a custom discovery deck after looking through your digital presence. No one-size-fits-all solutions, only tailored advice to grow your business.<\/span><a href=\"https:\/\/ipullrank.com\/contact\"><span style=\"font-weight: 400;\"> Schedule your consultation session now<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li><li aria-level=\"1\"><strong>Get Our Newsletter:<\/strong> AI is reshaping search. The Rank Report gives you signal through the noise, so your brand doesn\u2019t just keep up, it leads. <a href=\"https:\/\/ipullrank.com\/rank-report\">Subscribe to the Rank Report.<\/a><\/li><li style=\"font-weight: 400;\" aria-level=\"1\"><b>Enhance Your Content&#8217;s Relevance with Relevance Doctor:<\/b><span style=\"font-weight: 400;\"> Not sure if your content is mathematically relevant? Use Relevance Doctor to test and improve your content&#8217;s relevancy, ensuring it ranks for your targeted keywords.<\/span><a href=\"https:\/\/ipullrank.com\/tools\/relevance-doctor\"><span style=\"font-weight: 400;\"> Test your content relevance today<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li><\/ol><p><span style=\"font-weight: 400;\">Want more? Visit <\/span><a href=\"https:\/\/ipullrank.com\/blog\">our blog<\/a> <span style=\"font-weight: 400;\">for access to past webinars, exclusive guides, and insightful blogs crafted by our team of experts.\u00a0<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Since learning of the importance and magnitude of vector embeddings, I have been proposing that a link index should vectorize the web and make those representations of pages available to SEOs. Fundamentally, with the further integration of machine learning in Google\u2019s ranking systems, vector embeddings are even more important to what we do than an [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":18446,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260,26,33],"tags":[263,240],"diagnosis-deliverable":[],"class_list":["post-17504","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-relevance-engineering","category-seo","category-tools","tag-mike-king-best-2","tag-popular-article"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>SEO Use Cases for Vectorizing the Web with Screaming Frog<\/title>\n<meta name=\"description\" content=\"Learn the SEO use cases when you leverage Screaming Frog&#039;s new feature to run bespoke JS functions and generate vector embeddings from OpenAI.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"SEO Use Cases for Vectorizing the Web with Screaming Frog\" \/>\n<meta property=\"og:description\" content=\"Learn the SEO use cases when you leverage Screaming Frog&#039;s new feature to run bespoke JS functions and generate vector embeddings from OpenAI.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\" \/>\n<meta property=\"og:site_name\" content=\"iPullRank\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-08T21:41:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-31T19:51:35+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/284.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Mike King\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/284.jpg\" \/>\n<meta name=\"twitter:creator\" content=\"@ipullrankagency\" \/>\n<meta name=\"twitter:site\" content=\"@ipullrankagency\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mike King\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"23 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#article\",\"isPartOf\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\"},\"author\":{\"name\":\"Mike King\",\"@id\":\"https:\/\/ipullrank.com\/#\/schema\/person\/82831a4b9f4b8be81d5a9bfed4cb9b20\"},\"headline\":\"Vector Embeddings is All You Need: SEO Use Cases for Vectorizing the Web with Screaming Frog\",\"datePublished\":\"2024-05-08T21:41:21+00:00\",\"dateModified\":\"2025-07-31T19:51:35+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\"},\"wordCount\":4325,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/ipullrank.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png\",\"keywords\":[\"mike king best 2\",\"Popular article\"],\"articleSection\":[\"Relevance Engineering\",\"SEO\",\"Tools\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\",\"url\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\",\"name\":\"SEO Use Cases for Vectorizing the Web with Screaming Frog\",\"isPartOf\":{\"@id\":\"https:\/\/ipullrank.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage\"},\"image\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png\",\"datePublished\":\"2024-05-08T21:41:21+00:00\",\"dateModified\":\"2025-07-31T19:51:35+00:00\",\"description\":\"Learn the SEO use cases when you leverage Screaming Frog's new feature to run bespoke JS functions and generate vector embeddings from OpenAI.\",\"breadcrumb\":{\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage\",\"url\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png\",\"contentUrl\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png\",\"width\":699,\"height\":400},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ipullrank.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Vector Embeddings is All You Need: SEO Use Cases for Vectorizing the Web with Screaming Frog\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ipullrank.com\/#website\",\"url\":\"https:\/\/ipullrank.com\/\",\"name\":\"iPullRank\",\"description\":\"Digital Marketing Agency in NYC\",\"publisher\":{\"@id\":\"https:\/\/ipullrank.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ipullrank.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ipullrank.com\/#organization\",\"name\":\"iPullRank\",\"url\":\"https:\/\/ipullrank.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ipullrank.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2025\/03\/Logo_-_Layers.svg\",\"contentUrl\":\"https:\/\/ipullrank.com\/wp-content\/uploads\/2025\/03\/Logo_-_Layers.svg\",\"width\":177,\"height\":36,\"caption\":\"iPullRank\"},\"image\":{\"@id\":\"https:\/\/ipullrank.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/ipullrankagency\",\"https:\/\/www.linkedin.com\/company\/ipullrank\/\",\"https:\/\/www.youtube.com\/@iPullRankSEO\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/ipullrank.com\/#\/schema\/person\/82831a4b9f4b8be81d5a9bfed4cb9b20\",\"name\":\"Mike King\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ipullrank.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/d57e62b40de6db99771f85cbce3ab1d29071b8cd0d643c8dcf2fc55818e1769f?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/d57e62b40de6db99771f85cbce3ab1d29071b8cd0d643c8dcf2fc55818e1769f?s=96&d=mm&r=g\",\"caption\":\"Mike King\"},\"description\":\"Mike King is the Founder and CEO of iPullRank. Deeply technical and highly creative, Mike has helped generate over $4B in revenue for his clients. A rapper and recovering big agency guy, Mike's greatest clients are his two daughters: Zora and Glory.\",\"url\":\"https:\/\/ipullrank.com\/author\/ipullrank\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"SEO Use Cases for Vectorizing the Web with Screaming Frog","description":"Learn the SEO use cases when you leverage Screaming Frog's new feature to run bespoke JS functions and generate vector embeddings from OpenAI.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need","og_locale":"en_US","og_type":"article","og_title":"SEO Use Cases for Vectorizing the Web with Screaming Frog","og_description":"Learn the SEO use cases when you leverage Screaming Frog's new feature to run bespoke JS functions and generate vector embeddings from OpenAI.","og_url":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need","og_site_name":"iPullRank","article_published_time":"2024-05-08T21:41:21+00:00","article_modified_time":"2025-07-31T19:51:35+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/284.jpg","type":"image\/jpeg"}],"author":"Mike King","twitter_card":"summary_large_image","twitter_image":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/284.jpg","twitter_creator":"@ipullrankagency","twitter_site":"@ipullrankagency","twitter_misc":{"Written by":"Mike King","Est. reading time":"23 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#article","isPartOf":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need"},"author":{"name":"Mike King","@id":"https:\/\/ipullrank.com\/#\/schema\/person\/82831a4b9f4b8be81d5a9bfed4cb9b20"},"headline":"Vector Embeddings is All You Need: SEO Use Cases for Vectorizing the Web with Screaming Frog","datePublished":"2024-05-08T21:41:21+00:00","dateModified":"2025-07-31T19:51:35+00:00","mainEntityOfPage":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need"},"wordCount":4325,"commentCount":0,"publisher":{"@id":"https:\/\/ipullrank.com\/#organization"},"image":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage"},"thumbnailUrl":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png","keywords":["mike king best 2","Popular article"],"articleSection":["Relevance Engineering","SEO","Tools"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need","url":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need","name":"SEO Use Cases for Vectorizing the Web with Screaming Frog","isPartOf":{"@id":"https:\/\/ipullrank.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage"},"image":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage"},"thumbnailUrl":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png","datePublished":"2024-05-08T21:41:21+00:00","dateModified":"2025-07-31T19:51:35+00:00","description":"Learn the SEO use cases when you leverage Screaming Frog's new feature to run bespoke JS functions and generate vector embeddings from OpenAI.","breadcrumb":{"@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#primaryimage","url":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png","contentUrl":"https:\/\/ipullrank.com\/wp-content\/uploads\/2024\/05\/Frame-1597879902.png","width":699,"height":400},{"@type":"BreadcrumbList","@id":"https:\/\/ipullrank.com\/vector-embeddings-is-all-you-need#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ipullrank.com\/"},{"@type":"ListItem","position":2,"name":"Vector Embeddings is All You Need: SEO Use Cases for Vectorizing the Web with Screaming Frog"}]},{"@type":"WebSite","@id":"https:\/\/ipullrank.com\/#website","url":"https:\/\/ipullrank.com\/","name":"iPullRank","description":"Digital Marketing Agency in NYC","publisher":{"@id":"https:\/\/ipullrank.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ipullrank.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ipullrank.com\/#organization","name":"iPullRank","url":"https:\/\/ipullrank.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ipullrank.com\/#\/schema\/logo\/image\/","url":"https:\/\/ipullrank.com\/wp-content\/uploads\/2025\/03\/Logo_-_Layers.svg","contentUrl":"https:\/\/ipullrank.com\/wp-content\/uploads\/2025\/03\/Logo_-_Layers.svg","width":177,"height":36,"caption":"iPullRank"},"image":{"@id":"https:\/\/ipullrank.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/ipullrankagency","https:\/\/www.linkedin.com\/company\/ipullrank\/","https:\/\/www.youtube.com\/@iPullRankSEO"]},{"@type":"Person","@id":"https:\/\/ipullrank.com\/#\/schema\/person\/82831a4b9f4b8be81d5a9bfed4cb9b20","name":"Mike King","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ipullrank.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d57e62b40de6db99771f85cbce3ab1d29071b8cd0d643c8dcf2fc55818e1769f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d57e62b40de6db99771f85cbce3ab1d29071b8cd0d643c8dcf2fc55818e1769f?s=96&d=mm&r=g","caption":"Mike King"},"description":"Mike King is the Founder and CEO of iPullRank. Deeply technical and highly creative, Mike has helped generate over $4B in revenue for his clients. A rapper and recovering big agency guy, Mike's greatest clients are his two daughters: Zora and Glory.","url":"https:\/\/ipullrank.com\/author\/ipullrank"}]}},"_links":{"self":[{"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/posts\/17504","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/comments?post=17504"}],"version-history":[{"count":0,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/posts\/17504\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/media\/18446"}],"wp:attachment":[{"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/media?parent=17504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/categories?post=17504"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/tags?post=17504"},{"taxonomy":"diagnosis-deliverable","embeddable":true,"href":"https:\/\/ipullrank.com\/wp-json\/wp\/v2\/diagnosis-deliverable?post=17504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}