Class: WaybackArchiver::URLCollector
- Inherits:
-
Object
- Object
- WaybackArchiver::URLCollector
- Defined in:
- lib/wayback_archiver/url_collector.rb
Overview
Retrive URLs from different sources
Class Method Summary collapse
-
.crawl(url, hosts: [], limit: WaybackArchiver.max_limit) ⇒ Array<String>
Retrieve URLs by crawling.
-
.sitemap(url) ⇒ Array<String>
Retrieve URLs from Sitemap.
Class Method Details
.crawl(url, hosts: [], limit: WaybackArchiver.max_limit) ⇒ Array<String>
Retrieve URLs by crawling.
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/wayback_archiver/url_collector.rb', line 37 def self.crawl(url, hosts: [], limit: WaybackArchiver.max_limit) urls = [] start_at_url = Request.build_uri(url).to_s = { robots: WaybackArchiver.respect_robots_txt, hosts: hosts, user_agent: WaybackArchiver.user_agent } [:limit] = limit unless limit == -1 Spidr.site(start_at_url, ) do |spider| spider.every_page do |page| page_url = page.url.to_s urls << page_url WaybackArchiver.logger.debug "Found: #{page_url}" yield(page_url) if block_given? end end urls end |
.sitemap(url) ⇒ Array<String>
Retrieve URLs from Sitemap.
15 16 17 |
# File 'lib/wayback_archiver/url_collector.rb', line 15 def self.sitemap(url) Sitemapper.urls(url: Request.build_uri(url)) end |