본문 바로가기

별도

Beautiful Soup: 웹 크롤링의 시작

안녕하세요~~ 오늘은 Beautiful Soup을 사용해 웹사이트 데이터를 크롤링하는 방법에 대해서 이야기해볼게요! 제가 크롤링할 웹사이트는 ISRI라는 Information Systems 관련 모델과 구성 요소 정보가 있는 웹사이트입니다. 이 정보를 데이터프레임으로 정리하여 CSV 파일로 저장할 예정입니다.

 

https://isri.sciencesphere.org/index.php?o=constructs 👉🏻 크롤링할 링크입니다.

 

이 링크에 다양한 construct들이 나열된 목록이 있습니다. 이 목록에 있는 각 construct의 링크에 들어가서 해당 construct의 이름, 정의, 이론/모델, 참조 정보들을 추출할 예정입니다. 그리고 이 데이터를 데이터프레임에 추가할거예요. 

 

# import libraries 

from bs4 import BeautifulSoup
import requests
import time
import datetime
import pandas as pd
import smtplib

 

 

이와 같이 constructs라는 리스트로 construct 이름을 추가했어요. 

con_list = [
    "A Priori Attitudes",
    "Accessibility",
    "Accuracy",
    "Actual Frequency of Use",
    "Actual System Use",
    "Actual Usage",
    "Adaptive Use Intention",
    "Adoption Decision",
    "Affect",
    "Agreeableness",
    "Anxiety",
    "Applications for Fun",
    "Applications for Personal Use",
    "Asset Specificity",
    "Assistance",
    "Attitude",
    "Attitude toward Getting Information",
    "Attitude toward using technology",
    "Attitude Towards the Behavior",
    "Attitude Towards Use",
    "Authorization",
    "Availability",
    "Avoidance of Personal Interaction",
    "Awareness",
    "Awareness of Local Contexts",
    "Behavior",
    "Behavioral Expectation",
    "Behavioral Intention",
    "Behavioral Intention for Continued Use",
    "Behavioral Intention to Use",
    "Cognitive Absorption",
    "Cognitive Absorption (Control)",
    "Cognitive Absorption (Curiosity)",
    "Cognitive Absorption (Focused Immersion)",
    "Cognitive Absorption (Heightened Enjoyment)",
    "Cognitive Absorption (Temporal Dissociation)",
    "Cognitive Trust in Competence",
    "Cognitive Trust in Integrity",
    "Cognizance of Alternative Technologies",
    "Collaboration Quality",
    "Collaborative Norms",
    "Comfort with Change",
    "Comfort with Changes",
    "Communication Effectiveness",
    "Communication Quality",
    "Compatibility",
    "Compatibility with Existing Practices",
    "Compatibility with Preferred Work Style",
    "Compatibility with Prior Experience",
    "Compatibility with Values",
    "Competitive Intensity",
    "Competitive Pressure",
    "Completeness",
    "Complexity",
    "Computer Anxiety",
    "Computer Playfulness",
    "Computer Self-efficacy",
    "Concurrency",
    "Confirmation",
    "Conscientiousness",
    "Consistency",
    "Consistency with User Knowledge",
    "Consumer Willingness",
    "Content Quality",
    "Continuance Behavior",
    "Continuance Intention",
    "Controllability",
    "Controllability over Getting Information",
    "Convenience",
    "Cost",
    "Costs",
    "Currency",
    "Customization",
    "Declining Cost",
    "Disconfirmation",
    "Disposition to Trust",
    "Documentation",
    "Download Delay",
    "Ease of Use",
    "E-business Know-How",
    "E-Business Usage",
    "E-business Value (Impact on Commerce)",
    "E-business Value (Impact on Coordination)",
    "E-business Value (Impact on Internal Efficiency)",
    "Efficacy",
    "Effort Expectancy",
    "Emotional Trust",
    "Encouragement by Others",
    "Engagement with the Technology",
    "Environment Context (Competition Intensity)",
    "Environment Context (Competitive Pressure)",
    "Environment Context (Regulatory Environment)",
    "Environmental Uncertainty",
    "Environmental Uncertainty (Dynamism)",
    "Environmental Uncertainty (Heterogeneity)",
    "Environmental Uncertainty (Hostility)",
    "Experience",
    "External Computing Support",
    "External Influence",
    "External Pressure",
    "External Task Environment",
    "External Training",
    "External Variables",
    "Extraversion",
    "Facilitating Conditions",
    "Facilitating Conditions (Resources)",
    "Facilitating Conditions (Technology)",
    "Familiarity",
    "Familiarity with Communication Partners",
    "Family, Relatives, Friends, and Peer Influence",
    "Fear of Technological Advances",
    "Flexibility",
    "Format",
    "Frequency Imitation",
    "Friends and Family Influences",
    "Future Obligation",
    "Getting Information",
    "Getting Information Habit",
    "Getting Information Skills",
    "Governmental Influence",
    "Group Valence",
    "Group’s Perceptions About the Complexity of the Technology",
    "Group’s Perceptions About the Task-Technology Fit",
    "Group’s Strength of Adoption of the Technology",
    "Groupware Use",
    "Habit",
    "Hardware Quality",
    "Hedonic Motivation",
    "Hedonic Outcomes",
    "Image",
    "Immediacy",
    "Impact of Operational IS Use",
    "Impact of Strategic IS Use",
    "Impact of Tactical IS Use",
    "Impact on Downstream Sales",
    "Impact on Internal Operations",
    "Impact on Marketing and Sales",
    "Impact on Procurement",
    "Impact on Upstream Coordination",
    "Individual Adaptation Behaviors",
    "Individual Characteristics",
    "Individual Impact",
    "Individual Performance Impact (Performance Impact of Computer Systems)",
    "Individual performance improvement after groupware adoption",
    "Individualism/Collectivism",
    "Information Credibility",
    "Information Quality",
    "Information Satisfaction",
    "Integration",
    "Intention to Continue",
    "Intention to Continue Using",
    "Intention to Participate",
    "Intention to Reuse",
    "Intention to Use (Use)",
    "Intention to Use Future Features",
    "Intentions",
    "Intentions to Adopt",
    "Intentions to Get Information",
    "Internal Computing Support",
    "Internal Self-efficacy",
    "Internal Training",
    "Internet Penetration",
    "Internet Self-efficacy",
    "Internet Skills",
    "Interpersonal Influence",
    "Intra-group Conflict",
    "IT Infrastructure",
    "Job Relevance",
    "Job Satisfaction",
    "Knowledge of Search Domain",
    "Knowledge-Intensity",
    "Learning Goal Orientation",
    "Management Profile",
    "Management Support",
    "Managerial Obstacles",
    "Masculinity/Femininity",
    "M-Business Impact on Firm Performance",
    "M-Business Usage",
    "Media Fit (Information Exchange)",
    "Media Fit (Solve Problems)",
    "Mobile Environment",
    "Mobility",
    "Monetary Resources",
    "Net Benefits",
    "Network Externality",
    "Network Externality (Use of Complementary Products)",
    "Neuroticism",
    "Normative Beliefs",
    "Normative Influences",
    "Objective Usability",
    "Observability",
    "Openness to experience",
    "Organization",
    "Organization Context (Financial Resources)",
    "Organization Context (Firm Size)",
    "Organization Context (Global Scope)",
    "Organization Size",
    "Organizational Impact",
    "Organizational Support",
    "Others' Use",
    "Outcome Expectations (Performance)",
    "Outcome Expectations (Personal)",
    "Outcome Imitation",
    "Output Quality",
    "Partner Pressure",
    "Partner Readiness",
    "Past Experience – Getting Information",
    "Past Experience – Purchasing",
    "Peer Influence",
    "Perceived Behavioral Control",
    "Perceived Behavioral Control over Getting Information",
    "Perceived Behavioral Control over Purchasing",
    "Perceived Benefits",
    "Perceived Complexity",
    "Perceived Credibility",
    "Perceived Critical Mass",
    "Perceived Diagnosticity",
    "Perceived Ease of Getting Information",
    "Perceived Ease of Purchasing",
    "Perceived Ease of Use",
    "Perceived Effectiveness",
    "Perceived Efficiency",
    "Perceived Enjoyment",
    "Perceived Financial Resources",
    "Perceived Frequency of Use",
    "Perceived Individual Benefits",
    "Perceived Information Protection",
    "Perceived Innovativeness",
    "Perceived Long-term Usefulness",
    "Perceived Near-term Usefulness",
    "Perceived Network Externalities",
    "Perceived Organizational Benefits",
    "Perceived Output Quality",
    "Perceived Performance",
    "Perceived Personalization",
    "Perceived Playfulness",
    "Perceived Purchasing Usefulness",
    "Perceived Resources",
    "Perceived Service Cost",
    "Perceived Technology Control",
    "Perceived Usefulness",
    "Perceived Usefulness (Adoption)",
    "Perceived Usefulness (Post-adoption)",
    "Perceived Usefulness (Productivity)",
    "Perceived Usefulness (Resource Advantage)",
    "Perceived Usefulness of Getting Information",
    "Perceived Value",
    "Perceived Voluntariness of Use",
    "Perceptions of External Control",
    "Perceptions of Internal Control",
    "Performance Expectancy",
    "Performance Impacts",
    "Personal Innovativeness",
    "Personal Innovativeness In Information Technology",
    "Personal Network Exposure",
    "Plan Quality",
    "Power Distance",
    "Predicted Usage",
    "Price Value",
    "Prior Computer Experience",
    "Prior experience",
    "Prior Use",
    "Privacy",
    "Process Quality",
    "Process Standardization",
    "Product Involvement",
    "Product Value",
    "Psychological Ownership of Information Technology",
    "Purchasing",
    "Purchasing Attitude",
    "Purchasing Controllability",
    "Purchasing Habit",
    "Purchasing Intentions",
    "Purchasing Self-Efficacy",
    "Purchasing Skills",
    "Purchasing Subjective Norm",
    "Relative Advantage",
    "Relevance",
    "Reliability",
    "Replacement Versus Disenchantment Discontinuance",
    "Result Demonstrability",
    "Risk Awareness",
    "Satisfaction",
    "Satisfaction (Process Satisfaction)",
    "Satisfaction (Solution Satisfaction)",
    "Satisfaction with IS (Accessibility)",
    "Satisfaction with IS (Compatibility)",
    "Satisfaction with IS (Confusion)",
    "Satisfaction with IS (Ease of Use of Hardware and Software)",
    "Satisfaction with IS (Flexibility)",
    "Satisfaction with IS (Locatability)",
    "Satisfaction with IS (Reliability)",
    "Satisfaction with IS (Security)",
    "Satisfaction with IS (Service Quality)",
    "Satisfaction with IS (Timeliness)",
    "Security Risk",
    "Self-efficacy",
    "Self-reported Impact",
    "Self-reported Knowledge",
    "Self-reported Skills",
    "Self-reported Usage",
    "Self-reported Value",
    "Self-sufficiency",
    "Social Influence",
    "Social Influence (Bandwagon Effect)",
    "Social Norms",
    "Socioeconomic Status",
    "Subjective Norm",
    "Subjective Norm for Purchase",
    "Subjective Norm for Usage",
    "Support",
    "System Characteristics",
    "System Fit",
    "System Performance",
    "System Security",
    "System Security and Privacy",
    "Task Compatibility",
    "Task-Technology Fit",
    "Technological Readiness",
    "Technological Sophistication",
    "Technology",
    "Technology Involvement",
    "Technology Complexity",
    "Technology Characteristics",
    "Technology Characteristics (Adaptation)",
    "Technology Characteristics (Control)",
    "Technology Characteristics (Accessibility)",
    "Technology Characteristics (Immediacy)",
    "Technology Characteristics (Location and Context)",
    "Technology Characteristics (Privacy)",
    "Technology Characteristics (User Experience)",
    "Technology Characteristics (Value Proposition)",
    "Technology Characteristics (Work Environment)",
    "Technology Confidence",
    "Technology Fit",
    "Technology Knowledge",
    "Technology Trust",
    "Technology Use",
    "Technology Use in Purchasing",
    "Technology Use in Shopping",
    "Technology Usage",
    "Time Pressure",
    "Time Required for Use",
    "Tendency to Change",
    "Task Complexity",
    "Task-Fit",
    "Task-Technology Fit",
    "Task-Technology Fit (Information Systems)",
    "Task-Technology Fit (Marketing)",
    "Task-Technology Fit (Purchasing)",
    "Task-Technology Fit (Selling)",
    "Task-Technology Fit (Service)",
    "Trust",
    "Trust in Information",
    "Trust in IT",
    "Trust in Organization",
    "Trust in Technology",
    "Use Behavior",
    "Usefulness",
    "User Engagement",
    "User Involvement",
    "User Interaction",
    "User Motivation",
    "User Preference",
    "User Self-Efficacy",
    "User Satisfaction",
    "User Skills",
    "User Trust",
    "Value",
    "Value Proposition",
    "Values and Goals",
    "Work Style",
    "Workplace"
]

 

이 코드는 주어진 URL에서 웹 데이터를 크롤링하여 Construct 관련 정보를 가져오고, 이를 DataFrame에 저장하는 함수입니다.

  1. url을 입력받습니다.
  2. 주어진 URL에서 HTML 데이터를 가져와 Construct 정보를 찾습니다.
  3. Construct의 이름, 정의, 이론/모델, 참고문헌을 추출합니다.
  4. 이 정보를 DataFrame에 추가하여 반환합니다.

<h3>는 HTML에서 제목을 나타내는 태그로, 각 Construct의 이름을 가져오는 데 사용됩니다~

 


여기에서 중요한 점은, 찾고자 하는 정보가 HTML 내에서 사용된 태그들을 잘 분석하고 정확히 파악하는 것이예요. HTML 구조를 확인한 후, 원하는 데이터를 가져오기 위해 적절한 방법을 적용해야 해요: 
  • HTML의 구조와 계층 관계를 이해하기.
  • 원하는 데이터를 포함하고 있는 태그를 식별하기 (div, h3, p 등).
  • 태그 속성 (class, id)이나 텍스트 내용을 기준으로 데이터를 선택하기.
  • 데이터를 적절히 가공하고 원하는 형태로 반환하기.
import requests
from bs4 import BeautifulSoup
import pandas as pd

def put_to_df(url):
    # URL에서 HTML 데이터를 가져옴
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # 페이지에서 모든 Construct 항목을 가져오기
    constructs = soup.select('div.tabs__content.active')  # 'active' 클래스를 가진 'div' 태그를 선택

    # DataFrame에 추가할 데이터를 저장할 빈 리스트
    rows = []

    # 각 construct에 대해 처리 수행
    for construct in constructs:
        # Construct Name (제목) 가져오기
        h3_tag = soup.find('h3', text=lambda t: t and "Construct:" in t)  # "Construct:"를 포함하는 h3 태그 찾기
        if h3_tag:
            construct_name = h3_tag.get_text(strip=True).replace("Construct:", "").strip()  # "Construct:" 제거 및 공백 제거
        else:
            construct_name = 'N/A'  # 데이터가 없을 경우 기본값 설정

        # Definition (정의) 가져오기
        paragraphs = construct.find_all('p')  # 여러 p 태그가 있을 수 있음
        for paragraph in paragraphs:
            # Definition 부분 찾기
            definition = paragraph.find('i', string="Definition:")
            if definition:
                # "Definition:" 이후의 텍스트 가져오기
                definition_text = paragraph.get_text(strip=True).replace("Definition:", "").strip()

                # Theory/model 부분 찾기
                theory_model = paragraph.find('i', string="Theory/model:")
                if theory_model and theory_model.next_sibling:
                    # "Theory/model:" 이후의 텍스트 가져오기
                    theory_model_text = theory_model.next_sibling.strip()
                else:
                    theory_model_text = 'N/A'  # 데이터가 없을 경우 기본값 설정

                # Reference 부분 찾기
                reference = paragraph.find('i', string="Reference:")
                if reference:
                    # "Reference:" 이후의 텍스트 가져오기 (링크가 있다면 텍스트 추출)
                    reference_text = reference.find_next('a').get_text(strip=True) if reference.find_next('a') else 'N/A'
                else:
                    reference_text = 'N/A'  # 데이터가 없을 경우 기본값 설정

                # 데이터를 리스트에 추가
                rows.append({
                    'Construct Name': construct_name,
                    'Definition': definition_text,
                    'Theory/model': theory_model_text,
                    'Reference': reference_text
                })

    # 리스트 데이터를 DataFrame으로 변환
    new_df = pd.DataFrame(rows)
    return new_df

 

URL에서는 공백을 사용할 수 없기 때문에 공백을 %20으로 대체해야 해요. 그래서 각 construct 이름에 대해 con.replace(" ", "%20")를 사용하여 공백을 %20으로 바꿨어요.

# Key-value 쌍을 생성
con_dict = {}
for con in con_list:  # 
    key = con.replace(" ", "%20")  # 공백을 '%20'으로 변경
    con_dict[con] = key

# 생성된 key-value 쌍을 출력
for con, key in con_dict.items():
    print(f"{con}: {key}")  # 키와 값을 출력

 

새로운 DataFrame을 생성한 후, con_dict에 있는 각 construct 이름과 키를 사용하여 URL을 만들고, 이를 put_to_df 함수에 전달하여 DataFrame을 업데이트해요.

final_df = pd.DataFrame()
for con, key in con_dict.items():
    temp_df = put_to_df("https://isri.sciencesphere.org/index.php?o=construct&c="+key)
    final_df = pd.concat([final_df, temp_df], ignore_index=True)
final_df.head(3)

final_df

 

마지막 단계로 데이터프레임을 csv 파일로 저장합니다~~~

final_df.to_csv("construct_definitions.csv")

 

'별도' 카테고리의 다른 글

CarSales 프로젝트 SQL 분석[Part 1]  (1) 2024.12.07
kinda 자기소개?  (1) 2024.05.08