YaCy-Bugtracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000091YaCyWishlist - Wunschlistepublic2011-12-05 19:262016-10-10 02:00
Reporterkilian 
Assigned Toadministrator 
PrioritynormalSeverityminorReproducibilityhave not tried
StatusresolvedResolutionfixed 
ETAnone 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000091: Random Crawl Start
DescriptionWhat do you thing about an option to start a crawl off an already indexed URL? Sometimes my crawler is idle and i don't want to enter a link manually. This could also improve the diversity of the links, as users tend to start crawls with popular URIs and a random crawl could find unindexed parts of the web.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0000164)
LA_FORGE (reporter)
2011-12-05 22:28

I appreciate this kind of feature very much. Should we call it "Random Crawl"? Chaotic "Criss-cross crawling" around the entire web would be a great feature in YaCy!!
(0000200)
Orbiter (manager)
2011-12-18 18:58

this is not a bad idea in general but it does not work: how do we get a 'random url'? You cannot generate urls like numbers. If you want to get urls that are not crawled already, then just crawl the urls provided by remote crawl lists. There are always enough available.

If you don't see that the remote crawling feature actually is what you want: please tell us how to produce a 'random url'?
(0000202)
kilian (reporter)
2011-12-18 19:46

How does remtze crawling exactly work? Does any peer who executes the remote ctawl start off with the crawl start url entered by the crawl starter? Or is the task diivided into pieces somehow (this makes more sense to me, index distribution over dht is way faster than crawling the same page twice or three times on different peers.(is it?)) Or does the crawl starter delegate each single url?
Thats important because this influences how far into the web yacy will go.

With my proposed feature yacy should just pick a single url from the index, if thats technically possible as everything is in a RWI, and start a crawl with a defined depth, and then pick the next url from the index and so on. I do not know how good or bad this is.
(0000203)
LA_FORGE (reporter)
2011-12-18 23:24

Or an additional feature for changing the way how yacy processes the already filled crawler queue. A more random method of crawling with "chaotic" mixing of the URLs.
(0000204)
kilian (reporter)
2011-12-19 10:03

I think la forges idea won't change anything, because everything in the crawler queue is crawled sooner or later, so randomizing the order does not change anything. maybe speed optimizations are possible, i don't know.
(0000222)
gack (reporter)
2011-12-29 14:01

My way:
Get random pages from wikipedia,
put the links within the random page in a file,
repeat until file is big enough,
start a crawl with this file,
repeat this forever ...

This is my quick and dirty Perl-code:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use HTML::LinkExtractor;

my $randomurl = shift || 'http://de.wikipedia.org/wiki/Spezial:Zuf%C3%A4llige_Seite'; [^]
my $loopsleep = 15;
my $packetsize = 1000;

srand(time());

my $ua = LWP::UserAgent->new;
$ua->agent("feedcrawler2/0.1 ");
my $LX = new HTML::LinkExtractor();

my %h_url = ();

sub formattime4filename {
    my $t = shift;
    my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime($t);
    my $str = sprintf("%.4d%.2d%.2d-%.2d%.2d%.2d", $year+1900, $mon+1, $mday, $hour, $min, $sec);
    return $str;
}

sub getPage {
    my $url = shift;
    my $page = "";
    print "GET: ".$url."\n";
    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req);
    if ($res->is_success) {
        # print substr($res->content, 0, 1000)." ...\n";
        $page = $res->content;
    } else {
        print $res->status_line, "\n";
    }
    return $page;
}

sub getRandomPageUrl {
    my $randomurl = shift;
    my @out = `wget --spider $randomurl 2>&1`;
    foreach my $line (@out) {
        if($line =~ /Platz:/) {
            my ($d1, $url, $d2) = split(/[ \[]/, $line);
            if(length($url)>0) {
                return $url;
            }
        }
    }
    return "";
}

#Es befinden sich 217 Einträge in dem lokalen Crawler-Puffer
sub getLocalQueueSize {
    my $url = "http://127.0.0.1:8090/IndexCreateWWWLocalQueue_p.html?limit=5"; [^]
    my $cnt = -1;
    my $uri = URI->new($url);
    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req);
    if ($res->is_success) {
        # print $res->content;
        my $page = $res->content;
        my(@lines) = split(/[\r\n]+/, $page);
        # print $page;
        foreach my $line (@lines) {
            if($line =~ /Es befinden sich / ) {
                my($x1,$x2) = split(/Es befinden sich /, $line);
                my($x2a,$x2b) = split(/<\/strong>/,$x2);
                if(length($x2a)>0) {
                    $x2a =~ s/\.//g;
                    $cnt = $x2a;
                    return $cnt;
                }
            }
            if($line =~ /Der lokale Crawler-Puffer ist leer/) {
                $cnt = 0;
                return $cnt;
            }
        }
    } else {
        print $res->status_line, "\n";
        $cnt = -1;
    }
    return $cnt;
}


sub deleteTerminatedProfiles {
    my $logfile = "/tmp/deleleteTerminatedProfiles.$$.log";
    my $outfile = "/tmp/deleleteTerminatedProfiles.$$.out";
    my @out = `wget -o $logfile -O $outfile 'http://127.0.0.1:8090/CrawlProfileEditor_p.xml' [^] --post-data "deleteTerminatedProfiles=Beendete Crawls löschen"`;
    foreach my $line (@out) {
        $line =~ s/[\r\n]+//;
        # printf("%s\n", $line);
    }
    my $remcnt=0;
        if(open(IN, "<$outfile")) {
            while(<IN>) {
                if(/<starturl>/) {
                    $remcnt++;
                }
            }
            close(IN);
            unlink($outfile);
        }
    return $remcnt;
}

sub startCrawler {
    my $file = shift;
    my $cnt = 0;
    my $url ='http://127.0.0.1:8090/Crawler_p.html?' [^]
        .'crawlingDomMaxPages=3000000&'
        .'intention=&'
        .'range=subpath&'
        .'indexMedia=on&'
        .'storeHTCache=off&'
        .'recrawl=nodoubles&'
        .'sitemapURL=&'
        .'repeat_time=7&'
        .'crawlingIfOlderUnit=day&'
        .'cachePolicy=iffresh&'
        .'indexText=on&'
        .'crawlingMode=file&'
        .'ipMustmatch=.*&'
        .'crawlingQ=on&'
        .'crawlingFile='.$file.'&'
        .'mustnotmatch=&'
        .'bookmarkTitle=&'
        .'countryMustMatchSwitch=false&'
        .'crawlingstart=Neuen%20Crawl%20starten&'
        .'ipMustnotmatch=&mustmatch=.*&'
        .'crawlingIfOlderNumber=7&'
        .'repeat_unit=seldays&'
        .'crawlingDepth=0&'
        .'countryMustMatchList=';

    # print "GET: ".$url."\n";
    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req);
    if ($res->is_success) {
        # print $res->content;
        my $page = $res->content;
        my(@lines) = split(/[\r\n]+/, $page);
        # print $page;
        foreach my $line (@lines) {
            if($line =~ /<tr class="TableCell/ ) {
                $cnt++;
            }
        }
        print "OK\n";
    } else {
        print $res->status_line, "\n";
    }
    return $cnt;
}

my $cnt = 0;
while(1) {
    my $queuesize = getLocalQueueSize();
    print "Local queue size is ".$queuesize."\n";
    if($queuesize >= 0 && $queuesize <= 2000) {
        $cnt++;
        my $tm = formattime4filename(time());
        my $outfile = "/tmp/feedcrawler2.".$tm.".html";
        print "Create ".$outfile."\n";
        if(open(OUT, ">$outfile")) {
            printf(OUT "<html>\n");
            printf(OUT "<head>\n");
            printf(OUT "</head>\n");
            printf(OUT "<body>\n");
            my $i = 0;
            while($i<$packetsize) {
                sleep(1);
                my $url = getRandomPageUrl($randomurl);
                if(length($url)>0 && ! exists $h_url{$url}) {
                    if($i<$packetsize) {
                        my $page = getPage($url);
                        # sleep(1);
                        if(length($page)>0) {
                            printf(OUT "<a href=\"%s\">%s</a>
\n", $url, $url);
                            $i++;
                            print "ADD[".$i."]: ".$url."\n";
                            $h_url{$url} = 0;
                            $LX->parse(\$page);
                            #print Dumper($LX->links);
                            for my $Link( @{ $LX->links } ) {
                                my $x = $$Link{href};
                                if($x =~ /^http/ && ! exists $h_url{$x} &&
                                    !( $x =~ /\/books\.google\./ ||
                                       $x =~ /wikipedia/ ||
                                       $x =~ /wikimedia/ ||
                                       $x =~ /dispatch\.opac\./ ||
                                       $x =~ /toolserver/ ||
                                       $x =~ /\.pdf$/ ||
                                       $x =~ /\.mp3$/)
                                  ) {
                                        my($p,$l,$d,$u) = split(/\//,$x);
                                        if(length($u)>0) {
                                            my $domurl = $p.'//'.$d;
                                            if(!exists $h_url{$domurl}) {
                                                $i++;
                                                printf(OUT "<a href=\"%s\">%s</a>
\n", $domurl, $domurl);
                                                print "ADD[".$i."]: ".$domurl."\n";
                                                $h_url{$domurl} = 1;
                                            }
                                        }
                                        $i++;
                                        printf(OUT "<a href=\"%s\">%s</a>
\n", $x, $x);
                                        print "ADD[".$i."]: ".$x."\n";
                                        $h_url{$x} = 2;
                                } else {
                                    if(length($x)>0) {
                                        #print "IGN: ".$x."\n";
                                    }
                                }
                                last if ($i >= $packetsize);
                            }
                        }
                    }
                }
            }
            printf(OUT "</body>\n");
            printf(OUT "</html>\n");
            close(OUT);
            print "Start crawler for ".$outfile." ...\n";
            startCrawler($outfile);
        }
        print "Delete terminated profiles ...\n";
        my $remcnt = deleteTerminatedProfiles();
        print $remcnt." profiles running.\n";
        print "\n";
    }
    
    print "sleep(".$loopsleep.") ...\n";
    sleep($loopsleep);
}
(0000240)
kilian (reporter)
2012-01-12 10:07

Not bad, although it is not a very 'userfriendly' solution.
You could post it in the Wiki somewhere.
(0001329)
BuBu (developer)
2016-10-10 02:00

in addition to the remote crawl option, meanwhile a AutoCrawl option is available in v1.90 (Advanced Crawler -> AutoCrawl)

see https://github.com/yacy/yacy_search_server/pull/40 [^]

- Issue History
Date Modified Username Field Change
2011-12-05 19:26 kilian New Issue
2011-12-05 22:28 LA_FORGE Note Added: 0000164
2011-12-18 18:58 Orbiter Note Added: 0000200
2011-12-18 19:46 kilian Note Added: 0000202
2011-12-18 23:24 LA_FORGE Note Added: 0000203
2011-12-19 10:03 kilian Note Added: 0000204
2011-12-29 14:01 gack Note Added: 0000222
2012-01-12 10:07 kilian Note Added: 0000240
2016-10-10 02:00 BuBu Note Added: 0001329
2016-10-10 02:00 BuBu Status new => resolved
2016-10-10 02:00 BuBu Resolution open => fixed
2016-10-10 02:00 BuBu Assigned To => administrator


Copyright © 2000 - 2018 MantisBT Team
Powered by Mantis Bugtracker