python 如何去掉正文末尾的http链接

发布网友发布时间：2022-11-08 07:14

共3个回答

热心网友时间：2023-11-08 04:54

import os,re
def check_flag(flag):
regex = re.compile(r'\.*img\/',re.M)
result = True if regex.search(flag) else False
return result

#soup = BeautifulSoup(open('index.html'))
from bs4 import BeautifulSoup

file = open('index.html', 'r', encoding='utf-8')
#file = open(r'index.html','r',encoding="UTF-8")
soup = BeautifulSoup(file, 'html.parser')
for element in soup.find_all('img'):
if 'src' in element.attrs:
print(element.attrs['src'])
if check_flag(element.attrs['src']):
#if element.attrs['src'].find("img"):
element.attrs['src'] = "/go/${basefact9uu99.currentMediaVersion}/css/QuansuCss/AE/2022/dxbpek2022/EN" + element.attrs['src']

print("##################################")
with open('indexmichenT8.html', 'w',encoding="UTF-8") as fp:
fp.write(soup.prettify()) # prettify()的作⽤是将sp美化⼀下，有可读性
file.close()

热心网友时间：2023-11-08 04:54

"试验以下方法：
1）空格怎么替换掉
2）排版缩进怎么处理
3）各种标签需要做特殊处理，比如<h1> <p>
4）表格排版
5）css处理
当然，也可以仅仅简单的用下面的正则表达式（这样会留有一部分问题没有处理）：
html=re.sub(""(?isu)<[^>]+>"","" "",html)
这样就可以将标签去掉。效但效果肯定是不理想的。
注：在其过程中只需要引入import re模版即可。"

热心网友时间：2023-11-08 04:55

调用python内置的re模块，用正则表达式匹配汉字即可