中文XML论坛--Google Sitemaps实用教程Google Sitemaps实用教程[转帖]

Google Sitemaps实用教程之三
作者：coiner | 2005-11-18 23:23:42 (779 次阅读)
　下面开始按照配置文件模板创建你自己的配置文件：

　　1.在文本编辑器中打开 example_config.xml 文件。将其另存为新文件（如 config.xml 或 mysite_config.xml）。
　　2.找到网站定义部分：
<site
base_url="[URL=http://www.example.com/]http://www.example.com/"[/URL]
store_into="/var/www/docroot/sitemap.xml.gz"
verbose="1">
　　用您的网站地址替换 base_url 值。在您希望保存 Sitemaps 的Web 服务器上，更改路径的 store_into 值。前面已经说过，可以指向到网站的根目录或者特定存储sitemap文件的文件夹里。我是将Sitemap.xml文件存储在网站根目录webroot下的，其配置如下：

　　使用.gz格式命名，脚本执行时会自动压缩你的地图文件，当然以也可以直接使用.xml格式，如果站点页面数量较大，为了方便Google下载你的地图文件，建议采用压缩格式。如果你的网站链接超过50000个，此程序会自动分开存储在不同的.gz文件里，最后会自动建立一个sitemap_index.xml文件作为主地图文件。

　　3.开始各部分参数的配置：
　　a)找到以下部分：

<!-- ** MODIFY or DELETE **
"url" nodes specify individual URLs to include in the map. <br>

Required attributes:
href - the URL

Optional attributes:
lastmod - timestamp of last modification (ISO8601 format)
changefreq - how often content at this URL is usually updated
priority - value 0.0 to 1.0 of relative importance in your site
-->

<url href="[URL=http://www.example.com/stats?q=name]http://www.example.com/stats?q=name"[/URL] />
<ur href="[URL=http://www.example.com/stats?q=age]http://www.example.com/stats?q=age"[/URL] lastmod="2004-11-14T01:00:00-07:00"
changefreq="yearly"
priority="0.3"
/>
　　此部分举出两个示例：第一个示例仅包括必要属性，即只有href，可以直接配置为你的网站地址，当然如果你只想对你得bbs制作地图，那么可以指定为bbs的web路径。而第二个示例则包括必要属性和可选属性：
　　lastmod属性指你的网站最后更新的时间必须按照标准格式填写
　　changefreq 属性将使 Google 大致了解网址的更新频率。google将会按照这个频率来下载你的地图文件
　　priority 属性使 Google 了解有关此页面相较网站上其他页面的相对重要性的信息。此属性不会对 Google 怎样比较您的页面与其他网站的页面产生影响，仅有助于 Google 了解网站的哪个页面您认为最重要。也就是说，如果你是将你的站分目录进行制作地图文件，那么你可以根据每个目录的重要程度给每个目录不同的值，0.0 到 1.0。我的配置如下：
<url href="[URL=http://www.bbar.cn]http://www.bbar.cn"[/URL]
lastmod="2005-11-01T01:00:00-07:00"
changefreq="weekly" //每周更新，你也可以用dayly
priority="0.3"
/>

　　b)找到以下部分：

<!-- ** MODIFY or DELETE **
"urllist" nodes name text files with lists of URLs.
An example file "example_urllist.txt" is provided.

Required attributes:
path - path to the file

Optional attributes:
encoding - encoding of the file if not US-ASCII
-->
<urllist path="example_urllist.txt" encoding="UTF-8" />
　　使用此格式指向包含您的网址列表的文本文件的路径和名称。您可以使用提供的 example_urllist.txt 文件作为此文本文件的模板。您需要在 Web 服务器上指定完整路径。如果您创建了非 UTF-8 编码的文本文件，则可以使用 encoding 属性指定这一编码。若有多个 .txt 文件，则可以使用通配符。例如：

<urllist path="example_urllist*.txt" encoding="UTF-8" />
　　对于每个包含在文本文件中的网址，可以指定最后修改日期、更改频率和优先级。请参阅 “URLlist 文本文件参考”部分以获取有关此文件结构的完整信息。
　　这部分内容告诉我们可以通过已制作好的url列表来提交地图，有些地图制作工具可以可以制作，由于我没有使用，所以我们就可以直接删除了！[/color]

　　c)找到以下部分：

<!-- ** MODIFY or DELETE **
"directory" nodes tell the script to walk the file system and
include all files and directories in the Sitemap.

Required attributes:
path - path to begin walking from
url - URL equivalent of that path

Optional attributes:
default_file - name of the index or default file for directory URLs

-->
<directory path="/var/www/icons" url="[URL=http://www.example.com/images/]http://www.example.com/images/"[/URL] />
<directory path="/var/www/docroot"url="[URL=http://www.example.com/]' target=_blank>http://www.example.com/"default_file="index.html"/>[/URL]
　　这一部分列举了两个示例。如果您的所有网页都包含在某个路径的子目录中，您只需提供一个条目。不过，如果有多个路径指向贵网站的网页，请针对每个提供一个条目。请记住，所有网址都要以您在第三步中指定的基本网址开头。例如， example_config.xml 文件中的两个示例都包含以 [url]http://www.example.com/[/url] 开头的网址。所以，两个网址都有效。
　　将示例项替换为您的网站的项。许多网站仅有一个指向基准网址的项。请确保 path 值为 Web 服务器上目录的完整路径。请确保 url 值是完整网址，如果需要还请包括协议（例如 http）并以斜线跟随。
　　可以使用 default_file 参数指定服务器将其用作目录默认页面的文件名。上例中，/var/www/docroot 将解析为 [url]http://www.example.com/index.html[/url]。而无须指定。但如果指定，则 Sitemaps 生成器将包括对每个子目录只映射一次（而不是同时列出目录网址和文件名网址）的页面，并将使用文件（而不是目录）的最后修改日期提取该页的 lastmod 属性。我的配置如下：

　　d)找到以下部分：

<!-- ** MODIFY or DELETE **
"accesslog" nodes tell the script to scan webserver log files to
extract URLs on your site. Both Common Logfile Format (Apache's default
logfile) and Extended Logfile Format (IIS's default logfile) can be read.

Required attributes:
path - path to the file
Optional attributes:
encoding - encoding of the file if not US-ASCII
-->
<accesslog path="/etc/httpd/logs/access.log" encoding="UTF-8" />
<accesslog path="/etc/httpd/logs/access.log.0" encoding="UTF-8" />
<accesslog path="/etc/httpd/logs/access.log.1.gz" encoding="UTF-8" />
　　此部分列举了三个示例。您应该替换这些条目，并为每个日志文件提供一个条目。请确保 path 值是 Web 服务器上的完整路径和文件名。如果日志文件不是采用 US-ASCII 或 UTF-8 编码，使用可选的 encoding 属性指定此编码。无需列出所有日志文件，您可以使用通配符。例如，在上例中，您可以提供以下条目（包括全部三种日志文件）：

<accesslog path="/etc/httpd/logs/access.log*" encoding="UTF-8" />
　　Sitemaps 生成器根据每个网址的访问频率，为从日志中找到的网址分配优先级。例如，被访问过 100 次的网址将得到一个比被访问过两次的网址更高的优先级。实际的优先级分配是相对的，取决于每个网址与网站中其他网址比较的结果。
　　这一部分也许会让人难以理解，其实这是一个很好的工具，他告诉我们这个地图生成脚本可以读取你的操作系统记录的网站访问日志，从而确定网页的访问频率，最后确定生成的地图文件中网址的排列顺序！具体实现的办法是，首先在网站路径下建立目录logs。然后对你的站点IIS做好日志属性的配置，如果是采用其他web服务器就用其相应的操作方式，对于IIS的具体配置可见下图：

　　e)找到以下部分：

<!-- ** MODIFY or DELETE **"sitemap" nodes tell the script to scan other Sitemap files. This can
be useful to aggregate the results of multiple runs of this script into
a single Sitemap.

Required attributes:
path - path to the file
-->
<sitemap path="/var/www/docroot/subpath/sitemap.xml" />
　　此部分列举了一个示例：您应该替换该条目，并为您希望包括的每个 Sitemaps 提供一个条目。请确保路径值是 Web 服务器上的完整路径和文件名。您可以列出 gzip 压缩的 Sitemaps，只要其扩展名为 .gz。无需列出所有 Sitemaps，您可以使用通配符。例如，以下条目将会包括以 "sitemap" 一词开头且扩展名为 .xml 的所有 Sitemaps。

<sitemap path="/var/www/docroot/subpath/sitemap*.xml" />
　　Sitemaps 生成器会提取所有网址以及针对您所列全部 Sitemaps 的每个网址列出的可选数据，并使用这些信息创建一个 Sitemaps 文件。目前，我们不能保证此方法会适用于除 Sitemaps 生成器以外的工具所创建的 Sitemaps。
　　这部分大致意思是说，如果你的网站针对多个目录分别制作了地图会使用这部分的参数，由于我的测试只对整站地图，没有分别对目录制作，所以此段直接删除了！[/color]

　　f)找到过滤器定义部分：

<!-- ********************************************************
FILTERS

Filters specify wild-card patterns that the script compares
against all URLs it finds. Filters can be used to exclude
certain URLs from your Sitemap, for instance if you have
hidden content that you hope the search engines don't find.

Filters can be either type="wildcard", which means standard
path wildcards (* and ?) are used to compare against URLs,
or type="regexp", which means regular expressions are used
to compare.

Filters are applied in the order specified in this file.
An action="drop" filter causes exclusion of matching URLs.
An action="pass" filter causes inclusion of matching URLs,
shortcutting any other later filters that might also match.
If no filter at all matches a URL, the URL will be included.
Together you can build up fairly complex rules.

The default action is "drop".
The default type is "wildcard".

You can MODIFY or DELETE these entries as appropriate for
your site. However, unlike above, the example entries in
this section are not contrived and may be useful to you as
they are.
********************************************************* -->

<filter action="drop" type="wildcard" pattern="*~" />

<filter action="drop" type="regexp" pattern="/\.[^/]*" />
　　您可以使用过滤将特定网址排除在生成的 Sitemaps 之外。您可以通过以下操作创建一个更简洁的列表，以减少重复列表的数量，或防止特定网址进入索引。请注意，如果使用 robots.txt 文件防止网址进入索引，则即使将网址包含在 Sitemaps 中，Google 也不会搜索它们并为它们编制索引。可以使用任意或所有的过滤方法。如果需要，可以删除不需要的项并创建其他项。下面是用法示例。

<filter action="drop" type="wildcard" pattern="*.jpg" / >
　　此过滤器将排除以 .jpg 结尾的网址。如果所有的网站图片都嵌入在 HTML 页面中，且不应作为独立网址访问，则您可能需要使用类似的过滤器。

<filter action="pass" type="wildcard" pattern="*.htm*" / >
<filter action="drop" type="wildcard" pattern="*" / >
　　此过滤器会接受所有 .htm* 文件，但排除任何其他文件。

　　过滤部分主要功能也很明确，如果你有不希望被收录的页面或者目录，可以通过文件扩展名或者目录进行，也可以配合robots.txt 使用，比如你的后台管理路径不希望背收录，可以在robots.txt 里声明，具体robots.txt 使用方法，[URL=http://www.googlepub.com/html/200511/244.html]可参考这篇文章[/URL]。我的配置是这样的：

*树形目录（最近20个回帖）	顶端
主题： Google Sitemaps实用教程Google Sitemaps实用教程[..(8032字) － admin，2006年3月2日
回复： [灌水](494字) － cnknot，2007年5月24日
回复：谁能不能告诉我，我的sitemaps配置不成功能不能给我写个大概的步骤，我比较笨..(74字) － woaizhou，2006年4月24日
回复： Google Sitemaps实用教程之五作者：coiner \| 2005-11-18 23:2..(2614字) － admin，2006年3月2日
回复： [B]Google Sitemaps实用教程之四[/B]作者：coiner \| 2005-11-..(1795字) － admin，2006年3月2日
回复： [B]Google Sitemaps实用教程之三[/B]作者：coiner \| 2005-11-..(10740字) － admin，2006年3月2日


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	62.500ms