YaCy-Bugtracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000518YaCy[All Projects] Generalpublic2014-12-27 00:252016-03-02 01:22
Reporteroyvinds 
Assigned ToBuBu 
PrioritynormalSeverityminorReproducibilityalways
StatusresolvedResolutionfixed 
ETAnone 
PlatformOSOS Version
Product VersionYaCy 1.8 
Target VersionFixed in Version 
Summary0000518: YaCy does not appear to respect Crawl-delay:
DescriptionFrom testing on my own sites it appears that YaCy still fetches pages a lot faster than 1 page per 10 seconds even if robots.txt says "Crawl-delay: 10".

YaCy really should respect Crawl-delay. I believe the crawlers failure to respect this and the utterly bad way it behaves in general is why some sites choose to deny YaCy in robots.txt - like ZH http://www.zerohedge.com/robots.txt [^]

I have also noticed that some sites just block YaCy by IP due to too many requests per minute and it is bad when your IP becomes blocked on sites you usually visit regularly (which is a reason one may want to crawl them).

/CrawlStartSite.html actually says " No more that two pages are loaded from the same host in one second (not more that 120 document per minute) to limit the load on the target server." and this is, in my opinion, a very bad design decision. If 10 spiders do this (and there are at minimum 10 spiders crawling my more popular sites at all times) then that's a whole lot of load on a dynamically generated site.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0001052)
Davide (reporter)
2015-05-23 01:58

IIRC, the access delay is cumulative for all crawlers.
About changing the user agent string, I've opened the wishlist ticket 0000579.
(0001223)
BuBu (developer)
2016-03-02 01:22

fix getting robots.txt for sites using none standard port (80 / 443)
in v1.83/9710

- Issue History
Date Modified Username Field Change
2014-12-27 00:25 oyvinds New Issue
2015-05-23 01:58 Davide Note Added: 0001052
2016-03-02 01:22 BuBu Note Added: 0001223
2016-03-02 01:22 BuBu Status new => resolved
2016-03-02 01:22 BuBu Resolution open => fixed
2016-03-02 01:22 BuBu Assigned To => BuBu


Copyright © 2000 - 2019 MantisBT Team
Powered by Mantis Bugtracker