<html> <head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story"> Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"><!-- Elsie --></a> ,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a> and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a> ;and they lived at the bottom of a well. </p> <pclass="story">...</p> </body> </html>
# <p class="story">Once upon a time there were three little sisters; and their names were</p> # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
运行结果如下,可以发现,所有的节点都被打印出来了,先生最外层的
HTML标签,其次从 head 标签一个个剥离,以此类推。
1 2 3 4 5
<html><head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>,
for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'
for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'
print soup.p.next_sibling # 实际该处为空白 print soup.p.prev_sibling #None 没有前一个兄弟节点,返回 None print soup.p.next_sibling.next_sibling #<p class="story">Once upon a time there were three little sisters; and their names were #<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; #and they lived at the bottom of a well.</p> #下一个节点的下一个兄弟节点是我们可以看到的节点
for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # None
for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # <p class="story">...</p> # u'...' # u'\n' # None
以上是遍历文档树的基本用法。 7.搜索文档树 (1)find_all(
name , attrs , recursive , text , kwargs ) find_all()
方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件 1)name 参数
name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
A.传字符串
最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful
Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签
soup.find_all(has_class_but_no_id) # [<p class="title"><b>The Dormouse's story</b></p>, # <p class="story">Once upon a time there were...</p>, # <p class="story">...</p>]
2)keyword
参数
注意:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为
id 的参数,Beautiful Soup会搜索每个tag的”id”属性
soup = BeautifulSoup(html, 'lxml') print(type(soup.select('title'))) print soup.select('title')[0].get_text() for title in soup.select('title'): print title.get_text()