Quantcast
Channel: Not sure why URL isn't being passed into a scraper of RSS feeds - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Not sure why URL isn't being passed into a scraper of RSS feeds

$
0
0

Just want to scrape news feeds from RSS.

import feedparserimport pandas as pdfrom datetime import datetimearchive = pd.read_csv("national_news_scrape.csv")# Your list of feedsfeeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},            {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},            {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},        #{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}]# Create an empty DataFrame to store the newsnews_df = pd.DataFrame(columns=['source', 'title', 'date', 'summary', "url"])# For each feed, parse it and add the news to the DataFramefor feed in feeds:    print(f"Scraping: {feed['title']}")    d = feedparser.parse(feed['url'])    for entry in d.entries:        # Some feeds do not have 'summary' field, handle this case        summary = entry.summary if hasattr(entry, 'summary') else ''        url = entry.link        # Add the news to the DataFrame        news_df = news_df.append({'source': feed['title'],'title': entry.title,"url": url,'date': datetime(*entry.published_parsed[:6]),'summary': summary,                                  }, ignore_index=True)combined = pd.concat([news_df, archive]).drop_duplicates()# Save the DataFrame to a CSV filenews_df.to_csv('national_news_scrape.csv', index=False)

Why doesn't it read the URL of an individual article?


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images