新书推介:《语义网技术体系》
作者:瞿裕忠,胡伟,程龚
   XML论坛     W3CHINA.ORG讨论区     计算机科学论坛     SOAChina论坛     Blog     开放翻译计划     新浪微博  
 
  • 首页
  • 登录
  • 注册
  • 软件下载
  • 资料下载
  • 核心成员
  • 帮助
  •   Add to Google

    >> XML与各种文件格式的相互转换及相关工具。 word to xml, xml to word, html to xml, xml to pdf,
    csv to xml, rtf to xml, text to xml, xml to text, xls to xml, xml to xls
    FOP
    [返回] 中文XML论坛 - 专业的XML技术讨论区XML.ORG.CN讨论区 - XML技术『 WORD to XML, HTML to XML 』 → From Word to XML 查看新帖用户列表

      发表一个新主题  发表一个新投票  回复主题  (订阅本版) 您是本帖的第 14920 个阅读者浏览上一篇主题  刷新本主题   树形显示贴子 浏览下一篇主题
     * 贴子主题: From Word to XML 举报  打印  推荐  IE收藏夹 
       本主题类别:     
     admin 帅哥哟,离线,有人找我吗?
      
      
      
      威望:9
      头衔:W3China站长
      等级:计算机硕士学位(管理员)
      文章:5255
      积分:18406
      门派:W3CHINA.ORG
      注册:2003/10/5

    姓名:(无权查看)
    城市:(无权查看)
    院校:(无权查看)
    给admin发送一个短消息 把admin加入好友 查看admin的个人资料 搜索admin在『 WORD to XML, HTML to XML 』的所有贴子 点击这里发送电邮给admin  访问admin的主页 引用回复这个贴子 回复这个贴子 查看admin的博客楼主
    发贴心情 From Word to XML

    From Word to XML
    By John E. Simpson
    Among the most-asked XML questions of all are those which ask how to process XML using a client application with which the questioner is already familiar. The bulk of these questions, in turn, focus on XML's virtues as an open, structured-data medium: "How do I use XML in a database?" for instance, or "How can I convert my XML document into an Excel spreadsheet (or vice-versa)?"

    But, especially given its roots in SGML and HTML, XML functions equally well as an open, structured-document medium. And that's where this month's question comes from.

    Note: I don't pretend that my answer here is definitive or encyclopedic. It covers only one solution among a host of alternatives. If the response to past columns of this sort is any indication, within a week or two you'll be able to find numerous reader-supplied comments at the end of the article, giving you pointers to other options.

    Q: How can I convert a Microsoft Word document into XML?
    A: Recent versions of Word claim "save as XML" features of one kind or another. Maybe that "claim" is too harsh; they do create well-formed XML documents, after all. But it's XML of a spectacularly hideous form, even for simple documents -- nearly as gnarly and impenetrable to the human eye as XSL-FO.

    (For a good idea of what to expect, see A. Russell Jones's recent article on devx.com, "[URL=http://www.devx.com/dotnet/Article/17358?trk=DXRSS_XML]Export Customized XML from Microsoft Word with VB.NET[/URL]." Don't worry if you don't know or care anything about VB.NET; just check out that article's Figure 1 -- which shows how the document appears in Word -- and its Listing 1 as well. The latter is the output of the document coming from Word 2003's "save as XML" feature.)

    Whether you like or don't like Word, or use it in your everyday working life, you may be called upon to convert a Word document to XML at some point. And if you don't even have Word in the first place, the quality of the word processor's "save as XML" output is moot anyway. What do you do then?

    A good place to start searching when you're pretty sure software for processing XML must exist, but you don't know where to find it, is xmlsoftware.com. In this case, use the site menu to locate the "[URL=http://www.xmlsoftware.com/convert.html]Conversion Tools[/URL]" page.

    As you can see, most XML-to/from-Word packages don't process "true" Word documents in the classic .doc form. Instead, they rely on Word's long-standing support for Rich Text Format (RTF). (RTF documents are "structured", after a fashion. But the language is intended primarily to support the display of textual matter -- not unlike Adobe's PDF. If you'd like to learn more about RTF, check [URL=http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp]the Microsoft site[/URL]. Another [URL=http://interglacial.com/rtf/]good source[/URL] is the interglacial.com site, put together by Sean M. Burke, author of [URL=http://www.oreilly.com/catalog/rtfpg/]The RTF Pocket Guide[/URL], published in 2003 by O'Reilly and Associates.)

    upCast: Word to RTF to XML
    At least one of the XML conversion tools on the xmlsoftware.com site does support native Word .doc conversion: upCast, from [URL=http://www.infinity-loop.de/]infinity-loop GmbH[/URL]. In this column I'll take a look at how upCast (currently at version 4) does its work.

    First, let's get the questions of platforms and licenses out of the way. upCast is Java-based and thus available cross-platform, with installers for Windows, Unix, and Macs. The licensing comes in a variety of flavors, including (among others) a commercial product, a free evaluation, and a "private" (single user, non-commercial) version.

    After installing upCast and browsing through its documentation (and the infinity-loop site), you find that its .doc file support is limited in one sense: the .doc file(s) in question must have been created using Word 97 (or later), on on a PC running Windows 95, 98, NT, or 2000. For other, earlier versions of Word and/or Windows, the document first must be saved as RTF; the RTF file then is fed into the upCast conversion process. This limitation shouldn't be a problem for most Windows users, but it is something to bear in mind.

    The .doc support relies on one other requirement: it uses an add-in, provided with upCast, called WordLink; this add-in saves the binary .doc as a temporary RTF file, using a copy of Word which is installed on the user's machine. So WordLink isn't available for Mac- and Unix-based upCast users. Hence, upCast users on these platforms are limited to processing RTF files only.

    Running upCast is fairly simple. The main dialog box consists of two sections:

    The upper section ("Import Settings") is for specifying input parameters, chief of which is the name of the source file to be converted:
    按此在新窗口浏览图片
    Figure 1: upCast import settings

    The lower section ("Export Settings") lets you identify the name and properties of the output:
    按此在新窗口浏览图片
    Figure 2: upCast export settings

    In the second screen shot, I've pulled down the selection list to show what you can do with upCast. By default, the program outputs an XML document using upCast's own built-in DTD. Here's a fragment of a resulting document in this vocabulary:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE document PUBLIC "-//infinity-loop//DTD upCast 4.0//EN"
    "http://www.infinity-loop.de/DTD/upcast/4.0/upcast.dtd">
    <?xml-stylesheet type="text/css" href="helloworld.css"?>
    <document
    xmlns:xlink="http://www.w3.org/1999/xlink"
    xmlns:html="http://www.w3.org/HTML/1998/html4"

    xml:lang="en"
    style="widows: 0; orphans: 0; word-break-inside: normal; \-ilx-block-border-mode: merge;">
    <documentinfo>
    <property name="title" value="Hello" type="text" />
    <property name="author" value="John Simpson" type="text" />
    <property name="numberOfPages" value="1" type="integer" />
    </documentinfo>
    <part style="page: pageStyle1;">
    <par class="Normal">Hello world!</par>
    </part>
    </document>

    This has a number of interesting features (highlighted in bold, above).

    First, note the xml-stylesheet PI. In order to capture not only the contents of the document (which appear later, as text strings within par elements), but also its look-and-feel, upCast extracts style information from the RTF document being processed and writes it to a Cascading Style Sheet. A small fragment of this style sheet looks like this:

    *[class=Normal] {
    display: block;
    /* Paragraph Properties: */
    text-align: left;
    margin-left: 0.0mm;
    /* Character Properties: */
    vertical-align: baseline;
    font-family: "Times New Roman", serif;
    color: #000000;
    font-size: 12.0pt;
    }

    With this style sheet and the PI, a viewer (such as a browser capable of displaying XML via CSS) can render the document's contents in something like the way they appear in the source document. This rendering isn't 100% exact, of course -- CSS doesn't do everything a word processor does, in exactly the same way, and browsers are notoriously inconsistent in the extent to which they support CSS.

    The second thing to notice about the output document is the two namespace declarations. One declares that the html: namespace prefix is associated with the HTML 4.0 namespace.

    The other (more interesting) one identifies an xlink: namespace prefix. How does upCast use XLink? In several ways, including these:

    Each hyperlink (including e-mail addresses) in the original Word document is converted to a link element with numerous XLink-specific attributes, such as:
    <par class="Normal"[other attributes]>e-mail:
    <link xlink:type="simple"
    xlink:show="replace"
    xlink:actuate="onRequest"
    xlink:href="mailto:simpson@polaris.net"
    >
    ...
    </link>
    </par>
    Each Word "bookmark" is translated into a reference element, which (like link) takes a variety of XLink attribute. The xlink:href attribute uses a fragment identifier to locate a specific portion of the document:
    <reference xlink:type="simple" xlink:show="other"
    xlink:actuate="onLoad"
    xlink:href="#theThirdItem"
    ...>3</reference>
    (Note also, by the way, the use of alternative values for the xlink:show and xlink:actuate attributes.)
    Each image embedded in the Word document is referenced with an empty XLinking image element.
    <image xlink:type="simple" xlink:href="myImage01.jpg"
    xlink:show="embed"
    xlink:actuate="onLoad"/>
    As I said, actually being able to use such XLinking markup presumes the availability of XLink-smart software. The Mozilla browser can handle simple XLinks in XML documents; for example, the email hyperlink in the first of the above three bullets displays correctly as:

    按此在新窗口浏览图片
    Figure 3: Mozilla view of upCast link element

    Again, though, you needn't use upCast simply to generate documents in upCast's own XML dialect. As you can see from the second screen shot above, other output options include XHTML 1.0 (Strict) and DocBook 4.2. (DocBook support is only beta-level, although I found no problems with it. And one thing it allows you to do is to migrate a document from Word to PDF, using software which generates PDF output, from DocBook input, without using Adobe Acrobat itself.) As with the output to the native upCast vocabulary, selecting the XHTML and DocBook output formats both cause corresponding CSS style sheets to be generated.

    I did encounter some surprises in the resulting XHTML display, but only for Word features with no precise or consistently-renderable CSS counterparts. On the whole, though, the display was remarkably close to the original. For instance, here's a portion of a screen capture from a Word document, as displayed in Word:

    按此在新窗口浏览图片
    Figure 4: Original document opened in Word

    And here's the corresponding output of the upCast-generated XHTML document, viewed in Mozilla:

    按此在新窗口浏览图片
    Figure 5: upCast-output version of above document, viewed in Mozilla

    按此在新窗口浏览图片  
    Also in XML Q&A

    [URL=http://www.xml.com/pub/a/2004/07/28/qa.html]From English to Dutch?[/URL]

    [URL=http://www.xml.com/pub/a/2004/06/30/qa.html]Trickledown Namespaces?[/URL]

    [URL=http://www.xml.com/pub/a/2004/05/26/qa.html]From XML to SMIL[/URL]

    [URL=http://www.xml.com/pub/a/2004/04/28/qa.html]From One String to Many[/URL]

    [URL=http://www.xml.com/pub/a/2004/03/31/qa.html]Getting in Touch with XML Contacts[/URL]


    Not perfect, but very good. A particularly neat touch is the translation of the Word document's bookmarks into true hypertext equivalents, using fragment identifiers which scroll the browser directly to the correct portion of the document.

    I haven't covered in this column the use of upCast's other output filter options Like the upCast XML, XHTML, and DocBook outputs, these other options seem to work smoothly and with few surprises. (My favorite of these is the "XSLT Processor" feature, which first generates an XML document and then transforms it to some other form, by way of a user-supplied style sheet and the Apache Xalan XSLT processor.) Nor have I covered the use of infinity-loop's parallel XML-to-Word product, unsurprisingly called downCast. If you're interested in straightforward translation back and forth between Word and various XML formats, though, I encourage you to investigate these other tools on your own. And of course, by all means take a look at the other software on xmlsoftware.com's "Conversion Tools" page.


       收藏   分享  
    顶(0)
      




    ----------------------------------------------

    -----------------------------------------------

    第十二章第一节《用ROR创建面向资源的服务》
    第十二章第二节《用Restlet创建面向资源的服务》
    第三章《REST式服务有什么不同》
    InfoQ SOA首席编辑胡键评《RESTful Web Services中文版》
    [InfoQ文章]解答有关REST的十点疑惑

    点击查看用户来源及管理<br>发贴IP:*.*.*.* 2005/2/23 23:41:00
     
     GoogleAdSense
      
      
      等级:大一新生
      文章:1
      积分:50
      门派:无门无派
      院校:未填写
      注册:2007-01-01
    给Google AdSense发送一个短消息 把Google AdSense加入好友 查看Google AdSense的个人资料 搜索Google AdSense在『 WORD to XML, HTML to XML 』的所有贴子 点击这里发送电邮给Google AdSense  访问Google AdSense的主页 引用回复这个贴子 回复这个贴子 查看Google AdSense的博客广告
    2024/4/20 12:52:06

    本主题贴数1,分页: [1]

    管理选项修改tag | 锁定 | 解锁 | 提升 | 删除 | 移动 | 固顶 | 总固顶 | 奖励 | 惩罚 | 发布公告
    W3C Contributing Supporter! W 3 C h i n a ( since 2003 ) 旗 下 站 点
    苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
    46.875ms