<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/wordpress-mu-1.0" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Do not crawl list?</title>
	<link>http://blogs.adventnet.com/raju/2005/09/07/do-not-crawl-list/</link>
	<description>Just another Blogs.adventnet.com weblog</description>
	<pubDate>Sat, 21 Nov 2009 08:51:58 +0000</pubDate>
	<generator>http://wordpress.org/?v=wordpress-mu-1.0</generator>

	<item>
		<title>by: raju</title>
		<link>http://blogs.adventnet.com/raju/2005/09/07/do-not-crawl-list/#comment-8</link>
		<pubDate>Tue, 23 Jan 2007 14:08:56 +0000</pubDate>
		<guid>http://blogs.adventnet.com/raju/2005/09/07/do-not-crawl-list/#comment-8</guid>
					<description>Badri:

robots.txt specify which pages not to crawl. I dont think they define which content not to crawl. Most of the content might be crawlable, but there could be some sensitive information that the user may not want to provide easy access to search engine.

Also, robots.txt has to be defined for each website. Lets say I list your phone number on some discussion forums in some random website, you'll not have an option to change that website's robots.txt. While you can ask the webmaster to remove it, this data is present in search engines DB forever. You'll keep getting spam calls. With the approach I mentioned, you'll just be able to go to a single central website and say 'Donot let any search engines store my phone number'.

The concept is similar to Do Not call registry (https://www.donotcall.gov/), but applied to search engines.

Raju</description>
		<content:encoded><![CDATA[<p>Badri:</p>
<p>robots.txt specify which pages not to crawl. I dont think they define which content not to crawl. Most of the content might be crawlable, but there could be some sensitive information that the user may not want to provide easy access to search engine.</p>
<p>Also, robots.txt has to be defined for each website. Lets say I list your phone number on some discussion forums in some random website, you&#8217;ll not have an option to change that website&#8217;s robots.txt. While you can ask the webmaster to remove it, this data is present in search engines DB forever. You&#8217;ll keep getting spam calls. With the approach I mentioned, you&#8217;ll just be able to go to a single central website and say &#8216;Donot let any search engines store my phone number&#8217;.</p>
<p>The concept is similar to Do Not call registry (https://www.donotcall.gov/), but applied to search engines.</p>
<p>Raju
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: badrinath</title>
		<link>http://blogs.adventnet.com/raju/2005/09/07/do-not-crawl-list/#comment-7</link>
		<pubDate>Tue, 23 Jan 2007 14:08:05 +0000</pubDate>
		<guid>http://blogs.adventnet.com/raju/2005/09/07/do-not-crawl-list/#comment-7</guid>
					<description>why maintain a central list when a robots.txt file maintained in each web site can tell search engines what to crawl and what not to?</description>
		<content:encoded><![CDATA[<p>why maintain a central list when a robots.txt file maintained in each web site can tell search engines what to crawl and what not to?
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
