无法分析形成不良的XML(Cannot parse poorly formed XML)
我一直试图解析这个提要 。 如果你点击该链接,你会发现它甚至无法在浏览器中正确解析它。
无论如何,我的托管服务不会让我使用simplexml_load_file,所以我一直使用cURL来获取它,然后将字符串加载到DOM中,如下所示:
$dom = new DOMDocument; $dom->loadXML($rawXML); if (!$dom) { echo 'Error while parsing the document'; exit; } $xml = simplexml_import_dom($dom);
但我得到错误(“DOMDocument :: loadXML()[domdocument.loadxml]:实体'nbsp'没有在实体中定义”),然后我尝试使用SimpleXMLElement没有运气(它显示相同的错误“解析器错误:实体'nbsp'未定义“等等,因为那个元素中的HTML)。
$xml = new SimpleXMLElement($rawXML);
所以我的问题是,如何跳过/忽略/移除该元素,以便我可以解析其余的数据?
编辑:感谢mjv的解决方案!...我只是这样做(对于其他人有同样的麻烦)
$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML); $rawXML = str_replace('</description>',']]></description>',$rawXML);
I've been trying to parse this feed. If you click on that link, you'll notice that it can't even parse it correctly in the browser.
Anyway, my hosting service won't let me use simplexml_load_file, so I've been using cURL to get it then loading the string into the DOM, like this:
$dom = new DOMDocument; $dom->loadXML($rawXML); if (!$dom) { echo 'Error while parsing the document'; exit; } $xml = simplexml_import_dom($dom);
But I get errors ("DOMDocument::loadXML() [domdocument.loadxml]: Entity 'nbsp' not defined in Entity"), then I tried using SimpleXMLElement without luck (it shows the same error "parser error : Entity 'nbsp' not defined", etc... because of the HTML in that one element).
$xml = new SimpleXMLElement($rawXML);
So my question is, how do I skip/ignore/remove that element so I can parse the rest of the data?
Edit: Thanks to mjv for the solution!... I just did this (for others that have the same trouble)
$rawXML = str_replace('<description>','<description><![CDATA[',$rawXML); $rawXML = str_replace('</description>',']]></description>',$rawXML);
原文:https://stackoverflow.com/questions/1518142
满意答案
是的,你认为POI会这样做是错误的。 Apache POI适用于Microsoft Office文件格式,而PDF则不适用。
您要么直接使用Apache PDFBox ,要么使用Apache Tika ,它将同时使用Microsoft Office和PDF文件格式(以及许多其他格式)。
Yes, you are wrong in believing that POI will do that. Apache POI works with Microsoft Office file formats, which PDF isn't.
You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others).
相关问答
更多Apache POI性能(Apache POI Performance)
apache poi从右到左(right to left with apache poi)
使用Apache poi设置边距(Set margins with Apache poi)
JasperReports会替换Apache POI(M $)和iText(PDF)吗?(Does JasperReports replace Apache POI (M$) and iText (PDFs)? [closed])
如何使用Apache POI从PDF中提取原始文本?(How can I extract raw text from PDFs using Apache POI?)
如何使用Apache POI从MS word文档的文本框中获取文本?(How to get text from textbox of MS word document using Apache POI?)
Apache POI和Aspose?(Apache POI and Aspose?)
如何使用Apache POI处理空行?(How To handle Null Row using Apache POI?)
Java Apache POI(Java Apache POI)
如何使用Apache POI添加上标文本(How to add superscript text using Apache POI)
相关文章
更多JSTL处理XML详解
Python解析XML文档
Solr Schema.xml和solrconfig.xml分析
对于XML 以及 DTD的一些了解
Solr Schema.xml和solrconfig.xml分析(转)
java面试题之XML部分(一)
Solr笔记(2)_Schema.xml和solrconfig.xml分析
XML视频教程
JDOM读取XML文件问题
关于Xstream解析XML的问题
最新问答
更多如何在Laravel 5.2中使用paginate与关系?(How to use paginate with relationships in Laravel 5.2?)
linux的常用命令干什么用的
由于有四个新控制器,Auth刀片是否有任何变化?(Are there any changes in Auth blades due to four new controllers?)
如何交换返回集中的行?(How to swap rows in a return set?)
在ios 7中的UITableView部分周围绘制边界线(draw borderline around UITableView section in ios 7)
使用Boost.Spirit Qi和Lex时的空白队长(Whitespace skipper when using Boost.Spirit Qi and Lex)
Java中的不可变类(Immutable class in Java)
WordPress发布查询(WordPress post query)
如何在关系数据库中存储与IPv6兼容的地址(How to store IPv6-compatible address in a relational database)
是否可以检查对象值的条件并返回密钥?(Is it possible to check the condition of a value of an object and JUST return the key?)
Copyright ©2023 peixunduo.com All Rights Reserved.粤ICP备14003112号
本站部分内容来源于互联网,仅供学习和参考使用,请莫用于商业用途。如有侵犯你的版权,请联系我们(neng862121861#163.com),本站将尽快处理。谢谢合作!