파이썬 에러, re와 BeautifulSoup error : html.parser type Error

파이썬으로 크롤링을 하다보면 웹상의 정보를 가져 온 뒤 원하는 특정 정보를 추출하는 과정에서 에러가 발생하는 경우가 있다.

정규표현식을 하기 위한 re만 쓰거나, Beautifulsoup 뷰티풀숲만 쓰는 경우에는 발생하지 않지만, 2가지를 같이 쓰는 경우에는 에러가 발생하게 된다.

네이버 뉴스 긁어오는 크롤러 : 에러발생

import requests
import re
from bs4 import BeautifulSoup

url="http://news.naver.com/main/main.nhn?mode=LSD&mid=shm&sid1=105"
html=requests.get(url)
bs_html=BeautifulSoup(html.content,"html.parser")

news_list=bs_html.find_all("a",{"class":"cluster_text_headline"})

for n in news_list:
    news = re.findall('href="(.+?)">(.+?)</a>',n)[0] 
    title = news[1]
    link = news[0].replace("amp;","") #네이버 링크에서 amp;라는 것 때문에 접근이 불가능하여 별도 처리
    print(title)
    print(link)

ㅇ 에러코드

Traceback (most recent call last):
  File "C:/projectAll/python1/test.py", line 12, in
    news = re.findall('href="(.+?)">(.+?)',n)[0]
  File "C:\Users\joy\AppData\Local\Programs\Python\Python37-32\lib\re.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

위와 같은 에러가 발생하는 이유는 BeautifulSoup4으로 반환하는 것은 텍스트 str의 형태가 아닌 태그 구분이 가능한 형태로 되어 있다. 허나 re로 정규표현식으로 특정 부분을 지정하기 위해서 정보를 넣어 줄때는 string 이나 bytes-like object를 넣어야 한다. 그렇기 때문에 기존에 뷰티프솝으로 반환한 정보는 str( 뷰티플숲 반환 정보 저장된 변수) string으로 변환 한 후에 넣어주면 해당 에러가 발생하지 않게 된다.

일반적으로 원하는 정보를 찾는 경우에 여러 라이브러리를 사용하게 되고, 리턴되는 타입의 차이로 인해 에러가 발생하는 경우가 있기 때문에 이러한 부분을 참고하고 있다가 처리 하는 것이 좋다.

네이버 뉴스 긁어오는 크롤러 : 에러 없음

에러 발생 코드 news = re.findall('href="(.+?)">(.+?)',n)[0]

에러 없는 코드 news = re.findall('href="(.+?)">(.+?)',str(n))[0]

import requests
import re
from bs4 import BeautifulSoup

url="http://news.naver.com/main/main.nhn?mode=LSD&mid=shm&sid1=105"
html=requests.get(url)
bs_html=BeautifulSoup(html.content,"html.parser")

news_list=bs_html.find_all("a",{"class":"cluster_text_headline"})

for n in news_list:
    news = re.findall('href="(.+?)">(.+?)</a>',str(n))[0] #str로 캐스팅
    title = news[1]
    link = news[0].replace("amp;","") #네이버 링크에서 amp;라는 것 때문에 접근이 불가능하여 별도 처리
    print(title)
    print(link)

저작자표시 비영리 변경금지

'Python > Python Crawling' 카테고리의 다른 글

파이썬 크롤러 : requests.get( )으로 못가져오는 사이트 (0)	2019.05.10
파이썬 크롤링, 웹 이미지 저장(파일 저장 방법) (0)	2019.05.09
파이썬 크롤링, BeautifulSoup으로 정보 가져오기 (0)	2019.05.09
파이썬 크롤링, re로 특정 정보 선택하기(간단한 정규표현식) (1)	2019.05.09
파이썬 크롤링, requests로 네이버 웹 정보 가져오기 (0)	2019.05.09

개발개발 공부로그

파이썬 에러, re와 BeautifulSoup error : html.parser type Error

네이버 뉴스 긁어오는 크롤러 : 에러발생

네이버 뉴스 긁어오는 크롤러 : 에러 없음

'Python > Python Crawling' 카테고리의 다른 글

댓글

티스토리툴바

파이썬 에러, re와 BeautifulSoup error : html.parser type Error

네이버 뉴스 긁어오는 크롤러 : 에러발생

네이버 뉴스 긁어오는 크롤러 : 에러 없음

'Python > Python Crawling' 카테고리의 다른 글

관련글

댓글

티스토리툴바