YaCy-Bugtracker - YaCy
View Issue Details
0000738YaCy[All Projects] Generalpublic2017-04-22 02:382019-07-28 08:38
shni 
 
normalmajorsometimes
newopen 
none 
YaCy 1.9 
 
0000738: YaCy stuck in noindex,follow wasteland
I've seen this happening many times: crawler is about to crawl a set of forums or search results or archive pages (the kind of pages with little or duplicate content which are usually not to be indexed by search engines). SEOs commonly meta-tag such pages as "noindex,follow" to let pagerank flow through the site, but the same time avoid such pages spam the Google index.

YaCy takes their instruction (noindex,follow) as is, but obviously has no way to deal with it properly. Thus YaCy will (in certain situations) crawl them for hours without indexing.
Same issue:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5061&p=29327#p29319 [^]

In my case it's a scheduled job. It's limited to 100 pages per seed URL, but that limit doesn't seem to work over scheduled jobs.

Solution would be to disable "noindex,follow" crawling at all. It makes little sense for YaCy anyway. A valid assumption is that important content is always linked from indexable pages and not hidden.
No tags attached.
Issue History
2017-04-22 02:38shniNew Issue

There are no notes attached to this issue.