java + dom4j.jar提取xml文档内容
作者:静远小和尚 发布时间:2023-11-29 03:55:10
标签:java,dom4j.jar,xml
本文实例为大家分享了java + dom4j.jar提取xml文档内容的具体代码,供大家参考,具体内容如下
资源下载页:点击下载
本例程主要借助几个遍历的操作对xml格式下的内容进行提取,操作不是最优的方法,主要是练习使用几个遍历操作。
xml格式文档内容:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
-<nitf version="-//IPTC//DTD NITF 3.3//EN" change.time="19:30" change.date="June 10, 2005">
-<head>
<title>An End to Nuclear Testing</title>
<meta name="publication_day_of_month" content="7"/>
<meta name="publication_month" content="7"/>
<meta name="publication_year" content="1993"/>
<meta name="publication_day_of_week" content="Wednesday"/>
<meta name="dsk" content="Editorial Desk"/>
<meta name="print_page_number" content="14"/>
<meta name="print_section" content="A"/>
<meta name="print_column" content="1"/>
<meta name="online_sections" content="Opinion"/>
-<docdata>
<doc-id id-string="619929"/>
<doc.copyright year="1993" holder="The New York Times"/>
-<identified-content>
<classifier type="descriptor" class="indexing_service"> * IC WEAPONS</classifier>
<classifier type="descriptor" class="indexing_service">NUCLEAR TESTS</classifier>
<classifier type="descriptor" class="indexing_service">TESTS AND TESTING</classifier>
<classifier type="descriptor" class="indexing_service">EDITORIALS</classifier>
<person class="indexing_service">CLINTON, BILL (PRES)</person>
<classifier type="types_of_material" class="online_producer">Editorial</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion/Editorials</classifier>
<classifier type="general_descriptor" class="online_producer">Nuclear Tests</classifier>
<classifier type="general_descriptor" class="online_producer">Atomic Weapons</classifier>
<classifier type="general_descriptor" class="online_producer">Tests and Testing</classifier>
<classifier type="general_descriptor" class="online_producer">Armament, Defense and Military Forces</classifier>
</identified-content>
</docdata>
<pubdata name="The New York Times" unit-of-measure="word" item-length="390" ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9F0CEFDF1439F934A35754C0A965958260" date.publication="19930707T000000"/>
</head>
-<body>
-<body.head>
-<hedline>
<hl1>An End to Nuclear Testing</hl1>
</hedline>
</body.head>
-<body.content>
-<block class="lead_paragraph">
<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>
</block>
-<block class="full_text">
<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>
<p>Not that nuclear wannabes will necessarily follow America's lead. Nor will an end to all testing assure an end to bomb-making; states like Pakistan have developed nuclear devices without testing them first.</p>
<p>But calling a halt to U.S. nuclear testing makes it easier for leaders in Russia and France to extend the moratoriums they are now observing and improve the atmosphere for prompt negotiation of a treaty to ban all tests.</p>
<p>That test ban in turn should shore up international support for the 1968 Nonproliferation Treaty, linchpin of efforts to stop the spread of nuclear arms, when it comes up for review in 1995. It will also bolster the backing for tighter controls on exports used in bomb-making.</p>
<p>Mr. Clinton has taken three helpful steps. He has extended the Congressionally mandated moratorium on U.S. tests that was due to expire last week. He has declared that the U.S. will not test unless another nation does so first. And he wants to negotiate a total ban on testing.</p>
<p>But the President also wants the nuclear labs to be prepared for a prompt resumption of warhead safety and reliability tests. This could cost millions of dollars and doesn't make much sense, since in Mr. Clinton's own words, "After a thorough review, my Administration has determined that the nuclear weapons in the United States' arsenal are safe and reliable."</p>
<p>Moreover, preparations for testing can take on a life of their own: 30 years after the Limited Test Ban Treaty put an end to above-ground tests, the U.S. still spends $20 million a year on Safeguard C, a program to keep test sites ready.</p>
<p>American security no longer rests on that sort of eternal nuclear vigilance. Mr. Clinton's moratorium may make America safer than all the tests and preparations for tests that the nuclear labs can dream up.</p>
</block>
</body.content>
</body>
</nitf>
提取代码:
对多文件进行操作,首先遍历所有文件路径,存到遍历器中,然后对遍历器中的文件路径进行逐一操作。
package com.njupt.ymh;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import edu.princeton.cs.algs4.In;
/**
* 返回文件名列表
* @author 11860
*
*/
public class SearchFile {
public static List<String> getAllFile(String directoryPath,boolean isAddDirectory) {
List<String> list = new ArrayList<String>(); // 存放文件路径
File baseFile = new File(directoryPath); // 当前路径
if (baseFile.isFile() || !baseFile.exists())
return list;
File[] files = baseFile.listFiles(); // 子文件
for (File file : files) {
if (file.isDirectory())
{
if(isAddDirectory) // isAddDirectory 是否将子文件夹的路径也添加到list集合中
list.add(file.getAbsolutePath()); // 全路径
list.addAll(getAllFile(file.getAbsolutePath(),isAddDirectory));
}
else
{
list.add(file.getAbsolutePath());
}
}
return list;
}
public static void main(String[] args) {
//SearchFile sFile = new SearchFile();
List<String> listFile = SearchFile.getAllFile("E:\\huadai", false);
System.out.println(listFile.size());
File file = new File(listFile.get(3));
In in = new In(listFile.get(4));
while (in.hasNextLine()) {
String readLine = in.readLine().trim(); // 读取当前行
System.out.println(readLine);
}
System.out.println(file.length());
}
}
package com.njupt.ymh;
import java.io.File;
import java.util.Iterator;
import java.util.List;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;
public class NewsPaper {
int doc_id; // 文章id
String doc_title; // 文章标题
String lead_paragraph ; // 文章首段
String full_text; // 文章内容
String date; // 文章日期
public NewsPaper(String xml) {
doc_id = -1; // 文章id
doc_title = null; // 文章标题
lead_paragraph = null; // 文章首段
full_text = null; // 文章内容
date = null; // 文章日期
searchValue(xml);
}
/**
* 加载Document文件
* @param fileName
* @return Document
*/
private Document load(String fileName) {
Document document = null; // 文档
SAXReader saxReader = new SAXReader(); // 读取文件流
try {
document = saxReader.read(new File(fileName));
} catch (DocumentException e) {
e.printStackTrace();
}
return document;
}
/**
* 获取Document的根节点
* @param args
*/
private Element getRootNode(Document document) {
return document.getRootElement();
}
/**
* 获取所需节点值
* @param xml
*/
private void searchValue(String xml) {
Document document = load(xml);
Element root = getRootNode(document); // 根节点
// 文章日期
date = xml.substring(10, 20);
// 文章标题
doc_title = root.valueOf("//head/title");
// 文章-id
List<Node> list_doc_id = document.selectNodes("//doc-id/@id-string");
for(Node ele:list_doc_id){
doc_id = Integer.parseInt(ele.getText());
}
// 文章内容
for (Iterator<Element> i = root.elementIterator(); i.hasNext();) {
Element el = (Element) i.next(); // head、body
// 对body节点进行操作
if (el.getName() == "body") { // body
for (Iterator<Element> body = el.elementIterator(); body.hasNext();) {
Element elbody = body.next();
if (elbody.getName() == "body.content") { //body.content
for (Iterator<Element> block = elbody.elementIterator(); block.hasNext();) {
Element block_class = (Element) block.next();
if (block_class.attributeValue("class").equals("full_text") ) { // full_text
List<Node> list_text = block_class.selectNodes("p");
for (Node text : list_text)
if (full_text == null)
full_text = text.getStringValue();
else
full_text = full_text +" " + text.getStringValue();
}
else { // lead_paragraph
List<Node> list_lead = block_class.selectNodes("p");
for (Node lead : list_lead)
if (lead_paragraph == null)
lead_paragraph = lead.getStringValue();
else
lead_paragraph = lead_paragraph +" "+ lead.getStringValue();
}
}
}
}
}
}
}
/**
* 获取文章标题
* @param args
*/
public String getTitle() {
return doc_title;
}
/**
* 获取文章id
* @param args
*/
public int getID() {
return doc_id;
}
/**
* 获取文章简介
* @param args
*/
public String getLead() {
if (getID() < 394070 && lead_paragraph != null && lead_paragraph.length() > 6) //1990-10-22之前
return lead_paragraph.substring(6);
else //1990-10-22之后
return lead_paragraph;
}
/**
* 获取文章正文
* @param args
*/
public String getfull() {
if (getID() < 394070 && full_text != null && full_text.length() > 6) //1990-10-22之前
return full_text.substring(6);
else
return full_text;
}
/**
* 获取文章日期
* @param args
*/
public String getDate() {
return date;
}
/**
* 判断获取的信息是否有用
* @return
*/
public boolean isUseful() {
if (getID() == -1)
return false;
if (getDate() == null )
return false;
if (getTitle() == null || getTitle().length() >= 255)
return false;
if (getLead() == null || getLead().length() >= 65535 )
return false;
if (getfull() == null || getfull().length() >= 65535)
return false;
return !isnum();
}
/**
* 挑出具有特殊开头的数字内容文章
* @return
*/
private boolean isnum() {
if (getfull() != null && getfull().length() > 24) {
if (getfull().substring(0, 20).contains("*3*** COMPANY REPORT") ) { // 剔除数字文章
return true;
}
}
return false;
}
public static void main(String[] args) {
List<String> listFile = SearchFile.getAllFile("E:\\huadai\\1989\\10", false); // 文件列表
//String date; // 日期
int count = 0;
int i = 0;
for (String string : listFile) {
NewsPaper newsPaper = new NewsPaper(string);
count++;
if (!newsPaper.isUseful()) {
i++;
System.out.println(newsPaper.getLead());
}
}
System.out.println(i + " "+ count);
}
}
来源:https://blog.csdn.net/qq_29672495/article/details/82860226
0
投稿
猜你喜欢
- 目录环境准备1.数据库操作1.1获取所有数据库1.2获取指定库的所有集合名1.3.删除数据库2.文档操作2.1插入文档2.2查询文档2.3分
- 前言前面一篇我们介绍了使用 shared_preferences实现简单的键值对存储,然而我们还会面临更为复杂的本地存储。比如资讯类 App
- Android之文件数据存储一、文件保存数据介绍Activity提供了openFileOutput()方法可以用于把数据输出到文件中,具体的
- spring-boot-devtools是一个为开发者服务的一个模块,其中最重要的功能就是自动应用代码更改到最新的App上面去。原理是在发现
- Springboot + Vue,定时任务调度的全套实现方案。这里用了quartz这个框架,实现分布式调度任务很不错,关于quarz的使用方
- 一、定义实体类Person,封装生成的数据package net.dc.test;public class Person { private
- Java基本概念JDK包含了不少Java开发相关命令。如,javac、java、javap、javaw、javadoc。虽然现在的Java开
- static修饰符是java里面非常常用的一个东西,用法也非常多。然而,在kotlin里竟然没有这个东西!那该如何替代呢?本文就总结了下ja
- Android异常详情介绍这种异常我遇到以下两种情况: 1. java.lang.IllegalStateException: No wra
- 在 Java 语言中,运算符有算数运算符、关系运算符、逻辑运算符、赋值运算符、字符串连接运算符、条件运算符。算数运算符算数运算符是我们最常用
- 本文作者:Spring_ZYL文章来源:https://blog.csdn.net/gozhuyinglong版权声明:本文版权归作者所有,
- jar包运行时提示jar中没有主清单属性解决办法在pom文件中添加<build> &n
- 相信不少喜欢开发的朋友都已经知道微信小程序是个什么物种了,楼主也是从小程序内测期间就开始关注,并且也写过几个已经上线的微信小程序。但是基本上
- 如果有哪一个做程序员的小伙伴说自己没有遇到中文乱码问题,我是不愿意相信的。今天在做微信订阅号的智能回复时,又一时迷乱的跳进了中文乱码这个火坑
- 当你在开发flutter应用的时候,有时会需要调用native的api,往往遇到flutter并没有相应的package, 这时候flutt
- 一、Rxjava使用场景为了模拟实际场景,从wanandroid网站找了二个接口,如下:(对Wanandroid表示感谢!)public i
- SharedPreferences介绍:SharedPreferences是Android平台上一个轻量级的存储类,主要是保存一些常用的配置
- 在我们日常开发过程中,通常会涉及到数据权限问题,下面以我们常见的一种场景举例:一个公司有很多部门,每个人所处的部门和角色也不同,所以数据权限
- tcp客户端示例#include <errno.h> #include <sys/socket.h> #includ
- 一般文本文件我们以日志文件.log文件为例:import java.io.BufferedReader; import java.io.Fi