information

 Checkout

 Følg udviklingen ...

 Vær årvågen ...

 Tag chancer ...

"Vær altid parat til at udforske verdenen med dens omskiftende teknologiske udvikling"

Mere info...

"Et anderledes CMS system med uanede muligheder, som sætter kunden i centrum med maximal grad af fleksibilitet"

mail
Validering XHTML 1.0 Strict
og Css

 

Detecting crawlers




How to detect search engine crawlers?


Today I was looking for a solution how to detect when a client is a search engine crawler, you can create a fancy solution for this, but in the .NET framework their is already a solution to detect search engine crawler. The property  Request.Browser.Crawler . If you use this property you always get false even if the site is visited by a search engine crawler, that's because it's not configured in a default installation of .NET. 


    In VB.NET:
    
''' <summary> ''' Check if request is a crawler ''' </summary> ''' <returns>true if crawler hit</returns> Public Shared Function isCrawler() As Boolean Dim clientBrowserCaps As System.Web.HttpBrowserCapabilities = System.Web.HttpContext.Current.Request.Browser ' Depends on configuration in web.config browserCaps section: If DirectCast(clientBrowserCaps, System.Web.Configuration.HttpCapabilitiesBase).Crawler Then trace.Write("isCrawler", "Called from crawler: " & System.Web.HttpContext.Current.Request.Browser.ToString, className) Return True Else Dim agent As String = System.Web.HttpContext.Current.Request.ServerVariables("HTTP_USER_AGENT") If agent.Contains("Googlebot") Then trace.Write("isCrawler", "Google bot detected: " & agent, className) Return True ElseIf agent.Contains("msnbot") Then trace.Write("isCrawler", "MSN bot detected: " & agent, className) Return True ElseIf agent.Contains("Yahoo") Or agent.Contains("Slurp") Then trace.Write("isCrawler", "Yahoo bot detected: " & agent, className) Return True ElseIf agent.Contains("Mercator") Then trace.Write("isCrawler", "Altavista bot detected: " & agent, className) Return True ElseIf agent.Contains("Baiduspider") Then trace.Write("isCrawler", "Baidu bot detected: " & agent, className) Return True ElseIf agent.Contains("ArchitextSpider") Then trace.Write("isCrawler", "Exite bot detected: " & agent, className) Return True ElseIf agent.Contains("Lycos_Spider") Then trace.Write("isCrawler", "Lycos bot detected: " & agent, className) Return True ElseIf agent.Contains("Ask Jeeves") Then trace.Write("isCrawler", "Ask Jeeves bot detected: " & agent, className) Return True ElseIf agent.Contains(".ibm.com") Then trace.Write("isCrawler", "IBM bot detected: " & agent, className) Return True End If End If Return False End Function

ASP.NET uses the <browsercaps> section in machine.config or web.config to determine the client browser is a crawler or not. In the default installation the crawler filter information is all blank,  that's why you'd always get false. To fix this problem, you should add the search engine crawler filters in the <browsercaps> and add this section to your web.config. Like this:

  1. <configuration>
  2.    <system .web >
  3.      <browserCaps>
  4.        <filter>
  5.         <!-- Google Crawler -->
  6.          <case   match = "Googlebot" >
  7.           browser=Googlebot
  8.           crawler=true
  9.          </case>
  10.  
  11.          <!-- Yahoo Crawler -->
  12.          <case match = "http\:\/\/help.yahoo.com\/help\/us\/ysearch\/slurp" >
  13.           browser=YahooCrawler
  14.           crawler=true
  15.          </case>
  16.        
  17.          <!-- MSN Crawler -->
  18.          <case   match = "msnbot" >
  19.           browser=msnbot
  20.           crawler=true
  21.          </case>
  22.        
  23.          <!-- check Alta Vista (Mercator) -->
  24.          <case   match = "Mercator" >
  25.           browser=AltaVista
  26.           crawler=true
  27.          </case>
  28.  
  29.          <!-- check Slurp (Yahoo uses this as well) -->
  30.          <case   match = "Slurp" >
  31.           browser=Slurp
  32.           crawler=true
  33.          </case>
  34.        
  35.          <!-- Baidu Crawler -->
  36.          <case   match = "Baiduspider" >
  37.           browser=Baiduspider
  38.           crawler=true
  39.          </case>
  40.  
  41.          <!-- check Excite -->
  42.          <case   match = "ArchitextSpider" >
  43.           browser=Excite
  44.           crawler=true
  45.          </case>
  46.  
  47.          <!-- Lycos -->
  48.          <case   match = "Lycos_Spider" >
  49.           browser=Lycos
  50.           crawler=true
  51.          </case>
  52.  
  53.          <!-- Ask Jeeves -->
  54.          <case   match = "Ask Jeeves" >
  55.           browser=AskJeaves
  56.           crawler=true
  57.          </case>
  58.  
  59.          <!-- IBM Research Web Crawler -->
  60.          <case match = "http\:\/\/www\.almaden.ibm.com\/cs\/crawler" >
  61.           browser=IBMResearchWebCrawler
  62.           crawler=true
  63.          </case>
  64.        </filter>
  65.      </browserCaps>
  66.    </system .web >
  67.   </configuration>


Tip: you can find more crawler info in your IIS logs ([Windows Folder]\system32\LogFiles)

     
Informations segmentTeknisk interesserede personer