YaCy-Bugtracker - YaCy
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000738||YaCy||[All Projects] General||public||2017-04-22 02:38||2019-07-28 08:38|
|Product Version||YaCy 1.9|
|Target Version||Fixed in Version|
|Summary||0000738: YaCy stuck in noindex,follow wasteland|
|Description||I've seen this happening many times: crawler is about to crawl a set of forums or search results or archive pages (the kind of pages with little or duplicate content which are usually not to be indexed by search engines). SEOs commonly meta-tag such pages as "noindex,follow" to let pagerank flow through the site, but the same time avoid such pages spam the Google index.|
YaCy takes their instruction (noindex,follow) as is, but obviously has no way to deal with it properly. Thus YaCy will (in certain situations) crawl them for hours without indexing.
|Steps To Reproduce|
|Additional Information||Same issue:|
In my case it's a scheduled job. It's limited to 100 pages per seed URL, but that limit doesn't seem to work over scheduled jobs.
Solution would be to disable "noindex,follow" crawling at all. It makes little sense for YaCy anyway. A valid assumption is that important content is always linked from indexable pages and not hidden.
|Tags||No tags attached.|
|2017-04-22 02:38||shni||New Issue|
|There are no notes attached to this issue.|