软件编程
位置:首页>> 软件编程>> java编程>> java + dom4j.jar提取xml文档内容

java + dom4j.jar提取xml文档内容

作者:静远小和尚  发布时间:2023-11-29 03:55:10 

标签:java,dom4j.jar,xml

本文实例为大家分享了java + dom4j.jar提取xml文档内容的具体代码,供大家参考,具体内容如下

资源下载页:点击下载

本例程主要借助几个遍历的操作对xml格式下的内容进行提取,操作不是最优的方法,主要是练习使用几个遍历操作。

xml格式文档内容:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
-<nitf version="-//IPTC//DTD NITF 3.3//EN" change.time="19:30" change.date="June 10, 2005">

-<head>

<title>An End to Nuclear Testing</title>

<meta name="publication_day_of_month" content="7"/>
<meta name="publication_month" content="7"/>
<meta name="publication_year" content="1993"/>
<meta name="publication_day_of_week" content="Wednesday"/>
<meta name="dsk" content="Editorial Desk"/>
<meta name="print_page_number" content="14"/>
<meta name="print_section" content="A"/>
<meta name="print_column" content="1"/>
<meta name="online_sections" content="Opinion"/>

-<docdata>

<doc-id id-string="619929"/>

<doc.copyright year="1993" holder="The New York Times"/>

-<identified-content>

<classifier type="descriptor" class="indexing_service"> * IC WEAPONS</classifier>
<classifier type="descriptor" class="indexing_service">NUCLEAR TESTS</classifier>
<classifier type="descriptor" class="indexing_service">TESTS AND TESTING</classifier>
<classifier type="descriptor" class="indexing_service">EDITORIALS</classifier>
<person class="indexing_service">CLINTON, BILL (PRES)</person>
<classifier type="types_of_material" class="online_producer">Editorial</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion/Editorials</classifier>
<classifier type="general_descriptor" class="online_producer">Nuclear Tests</classifier>
<classifier type="general_descriptor" class="online_producer">Atomic Weapons</classifier>
<classifier type="general_descriptor" class="online_producer">Tests and Testing</classifier>
<classifier type="general_descriptor" class="online_producer">Armament, Defense and Military Forces</classifier>

</identified-content>
</docdata>
<pubdata name="The New York Times" unit-of-measure="word" item-length="390" ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9F0CEFDF1439F934A35754C0A965958260" date.publication="19930707T000000"/>

</head>

-<body>

-<body.head>

-<hedline>

<hl1>An End to Nuclear Testing</hl1>

</hedline>
</body.head>

-<body.content>

-<block class="lead_paragraph">

<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>

</block>

-<block class="full_text">

<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>
<p>Not that nuclear wannabes will necessarily follow America's lead. Nor will an end to all testing assure an end to bomb-making; states like Pakistan have developed nuclear devices without testing them first.</p>
<p>But calling a halt to U.S. nuclear testing makes it easier for leaders in Russia and France to extend the moratoriums they are now observing and improve the atmosphere for prompt negotiation of a treaty to ban all tests.</p>
<p>That test ban in turn should shore up international support for the 1968 Nonproliferation Treaty, linchpin of efforts to stop the spread of nuclear arms, when it comes up for review in 1995. It will also bolster the backing for tighter controls on exports used in bomb-making.</p>
<p>Mr. Clinton has taken three helpful steps. He has extended the Congressionally mandated moratorium on U.S. tests that was due to expire last week. He has declared that the U.S. will not test unless another nation does so first. And he wants to negotiate a total ban on testing.</p>
<p>But the President also wants the nuclear labs to be prepared for a prompt resumption of warhead safety and reliability tests. This could cost millions of dollars and doesn't make much sense, since in Mr. Clinton's own words, "After a thorough review, my Administration has determined that the nuclear weapons in the United States' arsenal are safe and reliable."</p>
<p>Moreover, preparations for testing can take on a life of their own: 30 years after the Limited Test Ban Treaty put an end to above-ground tests, the U.S. still spends $20 million a year on Safeguard C, a program to keep test sites ready.</p>
<p>American security no longer rests on that sort of eternal nuclear vigilance. Mr. Clinton's moratorium may make America safer than all the tests and preparations for tests that the nuclear labs can dream up.</p>

</block>

</body.content>

</body>

</nitf>

提取代码:

对多文件进行操作,首先遍历所有文件路径,存到遍历器中,然后对遍历器中的文件路径进行逐一操作。


package com.njupt.ymh;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import edu.princeton.cs.algs4.In;

/**
* 返回文件名列表
* @author 11860
*
*/
public class SearchFile {

public static List<String> getAllFile(String directoryPath,boolean isAddDirectory) {
 List<String> list = new ArrayList<String>(); // 存放文件路径
 File baseFile = new File(directoryPath); // 当前路径

if (baseFile.isFile() || !baseFile.exists())
  return list;

File[] files = baseFile.listFiles(); // 子文件
 for (File file : files) {
  if (file.isDirectory())
  {
   if(isAddDirectory) // isAddDirectory 是否将子文件夹的路径也添加到list集合中
    list.add(file.getAbsolutePath()); // 全路径

list.addAll(getAllFile(file.getAbsolutePath(),isAddDirectory));
  }
  else
  {
   list.add(file.getAbsolutePath());
  }
 }
 return list;
}
public static void main(String[] args) {

//SearchFile sFile = new SearchFile();
List<String> listFile = SearchFile.getAllFile("E:\\huadai", false);
System.out.println(listFile.size());
File file = new File(listFile.get(3));
In in = new In(listFile.get(4));
while (in.hasNextLine()) {
String readLine = in.readLine().trim(); // 读取当前行
System.out.println(readLine);

}
System.out.println(file.length());

}

}

package com.njupt.ymh;

import java.io.File;
import java.util.Iterator;
import java.util.List;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;

public class NewsPaper {
int doc_id; // 文章id
String doc_title; // 文章标题
String lead_paragraph ; // 文章首段
String full_text; // 文章内容
String date; // 文章日期
public NewsPaper(String xml) {
doc_id = -1; // 文章id
doc_title = null; // 文章标题
lead_paragraph = null; // 文章首段
full_text = null; // 文章内容
date = null; // 文章日期
searchValue(xml);
}

/**
* 加载Document文件
* @param fileName
* @return Document
*/
private Document load(String fileName) {
Document document = null; // 文档
SAXReader saxReader = new SAXReader(); // 读取文件流

try {
document = saxReader.read(new File(fileName));
} catch (DocumentException e) {
e.printStackTrace();
}

return document;
}

/**
* 获取Document的根节点
* @param args
*/
private Element getRootNode(Document document) {
return document.getRootElement();
}

/**
* 获取所需节点值
* @param xml
*/
private void searchValue(String xml) {
Document document = load(xml);
 Element root = getRootNode(document); // 根节点

// 文章日期
 date = xml.substring(10, 20);
 // 文章标题
 doc_title = root.valueOf("//head/title");

// 文章-id
 List<Node> list_doc_id = document.selectNodes("//doc-id/@id-string");
 for(Node ele:list_doc_id){
  doc_id = Integer.parseInt(ele.getText());
 }

// 文章内容
 for (Iterator<Element> i = root.elementIterator(); i.hasNext();) {
  Element el = (Element) i.next(); // head、body

// 对body节点进行操作
  if (el.getName() == "body") { // body
   for (Iterator<Element> body = el.elementIterator(); body.hasNext();) {
 Element elbody = body.next();

if (elbody.getName() == "body.content") { //body.content
 for (Iterator<Element> block = elbody.elementIterator(); block.hasNext();) {
 Element block_class = (Element) block.next();

if (block_class.attributeValue("class").equals("full_text") ) { // full_text
 List<Node> list_text = block_class.selectNodes("p");
 for (Node text : list_text)
  if (full_text == null)
  full_text = text.getStringValue();
  else
  full_text = full_text +" " + text.getStringValue();
 }

else { // lead_paragraph
 List<Node> list_lead = block_class.selectNodes("p");
 for (Node lead : list_lead)
  if (lead_paragraph == null)
  lead_paragraph = lead.getStringValue();
  else
  lead_paragraph = lead_paragraph +" "+ lead.getStringValue();
 }
 }
 }
}
  }
 }
}

/**
* 获取文章标题
* @param args
*/
public String getTitle() {
return doc_title;
}

/**
* 获取文章id
* @param args
*/
public int getID() {
return doc_id;
}

/**
* 获取文章简介
* @param args
*/
public String getLead() {
if (getID() < 394070 && lead_paragraph != null && lead_paragraph.length() > 6)  //1990-10-22之前
return lead_paragraph.substring(6);
else       //1990-10-22之后
return lead_paragraph;
}

/**
* 获取文章正文
* @param args
*/
public String getfull() {
if (getID() < 394070 && full_text != null && full_text.length() > 6)   //1990-10-22之前
return full_text.substring(6);
else
return full_text;
}

/**
* 获取文章日期
* @param args
*/
public String getDate() {
return date;
}

/**
* 判断获取的信息是否有用
* @return
*/
public boolean isUseful() {
if (getID() == -1)
return false;
if (getDate() == null )
return false;
if (getTitle() == null || getTitle().length() >= 255)
return false;
if (getLead() == null || getLead().length() >= 65535 )
return false;
if (getfull() == null || getfull().length() >= 65535)
return false;

return !isnum();
}

/**
* 挑出具有特殊开头的数字内容文章
* @return
*/
private boolean isnum() {
if (getfull() != null && getfull().length() > 24) {
if (getfull().substring(0, 20).contains("*3*** COMPANY REPORT") ) { // 剔除数字文章
return true;
}
}
return false;
}

public static void main(String[] args) {
List<String> listFile = SearchFile.getAllFile("E:\\huadai\\1989\\10", false); // 文件列表
//String date; // 日期
int count = 0;
int i = 0;
for (String string : listFile) {
NewsPaper newsPaper = new NewsPaper(string);
count++;
if (!newsPaper.isUseful()) {
i++;
System.out.println(newsPaper.getLead());
}
}

System.out.println(i + " "+ count);

}
}

来源:https://blog.csdn.net/qq_29672495/article/details/82860226

0
投稿

猜你喜欢

  • 大家是不是平常都有好多文件需要定期备份?如歌曲、视频、文档,代码文件等等,如果经常增加删除修改文件,就需要定期备份,最早之前文件都不大的时候
  • 前言随着标准Java的版本更新,开发者总是可以从升级后的版本中获取想要的功能。本文将给大家详细介绍下mac下面的java9版本安装使用,分享
  • 0.写在前面2020-5-18更新这个东西已经是两年前的了,现在问我具体细节我也不是很清楚了,而且现在review两年前的代码感觉写的好烂。
  • 通过代码操作防火墙的方式有两种:一是代码操作修改注册表启用或关闭防火墙;二是直接操作防火墙对象来启用或关闭防火墙。不论哪一种方式,都需要使用
  • 一、前言高效、合理的使用hibernate-validator校验框架可以提高程序的可读性,以及减少不必要的代码逻辑。接下来会介绍一下常用一
  • java.util.Arrays类能方便地操作数组,它提供的所有方法都是静态的。静态方法是属于类的,不是属于类的对象。所以可以直接使用类名加
  • 一、关于堆JDK1.8中的PriortyQueue(优先级队列)底层使用了堆的数据结构,而堆实际就是在完全二叉树的基础之上进行了一些元素的调
  • 文章转自微信公众号:CPP开发前沿当进程结束使用共享内存区时,要通过函数 shmdt 断开与共享内存区的连接。该函数声明在 sys/shm.
  • 目录1 任务状态手动控制任务启动确保任务已激活2 任务取消3 进度报告4 Task.Yield 让步5 定制异步任务后续操作Configur
  • 本文实例为大家分享了Android实现图片查看器的具体代码,供大家参考,具体内容如下效果需要两个手指禁止缩放,所以没有光标,只能用手机投放电
  • Android Spinner 组件Spinner: 下拉组件使用事项:布局在XML 中实现,具体的数据在JAVA 代码中实现;所用知识点:
  • 前言在Spring Boot中有一个注释@Async,可以帮助开发人员开发并发应用程序。但使用此功能非常棘手。在本博客中,我们将了解如何将此
  • Toast是一种简易的消息提示框,它无法获取焦点,按设置的时间来显示完以后会自动消失,一般用于帮助或提示。先给大家分享下我的解决思路:不用计
  •  Android 文件操作详解Android 的文件操作说白了就是Java的文件操作的处理。所以如果对Java的io文件操作比较熟
  • 黑白棋介绍黑白棋,又叫苹果棋,最早流行于西方国家。游戏通过相互翻转对方的棋子,最后以棋盘上谁的棋子多来判断胜负。黑白棋非常易于上手,但精通则
  • 知识准备Timer和ScheduledExecutorService是JDK内置的定时任务方案,而业内还有一个经典的定时任务的设计叫时间轮(
  • instanceof关键字的使用1. 语法格式x instanceof A:检验x是否为类A的对象,返回值为boolean类型,如果是,返回
  • 在安装过后出现了这样的问题:于是看了一下,是找不到这个版本,于是到gradle文件里加了一句话,指定好版本,切记不要低于26,然后去sdk
  • 场景随着移动支付的兴起,在我们的app'中,会经常有集成支付的需求.这时候一般都会采用微信和支付宝的sdk 来集成(一)支付宝支付在
  •  C++实现接两个链表实例代码有以ha为头结点的链表,元素个数为m;以hb为头结点的链表,元素个数为n。现在需要你把这两个链表连接
手机版 软件编程 asp之家 www.aspxhome.com