Skip to content

We’ve Moved

January 15, 2010

OK, so that’s pretty obvious since this is at a new domain, with a new look.

101tec graciously hosted Bixo for the first 9 months, and we’d like to thank them for their support.

We’re in the process of migrating other services off of 101tec’s web site as well. The short list is:

  • Issue tracking is moving from Jira to the GitHub Bixo issues system. Existing issues will be ported over incrementally, based on priority.
  • Nexus (Maven repository) will be replaced by static web content served up by either GitHub or code.google.com.
  • TeamCity (build system) will be disabled until we have a pressing need for it.

The mailing list at Yahoo groups and source repository at GitHub remain unchanged.

If anything is broken, please send an email to the mailing list – thanks!

Advertisements
9 Comments leave one →
  1. Mark Sands permalink
    March 3, 2010 1:00 am

    Looks awesome guys, I just now stumbled upon this project and I’m pretty interested in checking it out.

    Mark

    PS Please stick with GitHub! 🙂

    • March 3, 2010 6:43 am

      Hi Mark,

      Glad you think it looks interesting.

      It would be useful for you to shoot a quick email to the list at bixo-dev, maybe with some notes about how you’d be interested in using Bixo. That helps guide the development effort.

      Thanks,

      — Ken

      PS – Don’t worry, we won’t be moving from GitHub…

  2. Nathan permalink
    March 30, 2010 2:14 pm

    Very interesting. Will be attempting to play with this.

  3. February 25, 2011 6:48 am

    We are in the process of developing one web data mining product for our own use.

    Found that Bixo is very suitable for it . We have downloaded and done the instructions on the Getting Started Section. By this we are able to just run bixo crawl and bixo status and observed that some files are created. But what is the next step? How we can retrieve the required information from the downloaded files? How we can customize or configure? For this we are unable to find details . Are they available in this web site? Or we have to try our self ? Or we have to go for Commercial support? Can u please help on this?

    Thanks

    R.Natarajan

    • February 27, 2011 9:47 am

      Hi Natarajan,

      Hopefully your questions have been answered on the Bixo mailing list (http://groups.yahoo.com/group/bixo-dev/).

      If not, please continue to ask your questions, and members will do their best to answer. These questions are useful, as it helps point out where documentation should be improved, so thanks!

      — Ken

  4. Laki permalink
    November 25, 2016 2:54 am

    You mention on the homepage you group URLs by IP addresses which I think is problematic as there are IP addresses which host millions of domains (i.e. 216.239.32.21).

    n.b. you can add https://www.semanticjuice.com/ to your vertical web crawlers page.

    • November 29, 2016 1:10 pm

      The reason to group by IP address is to address exactly the issue you mention, where millions of domains could be hosted by a single address. If you don’t group by IP address when applying throttling (required for polite crawling), you can easily overwhelm a site, which is something that’s very, very important to avoid. Read the IRLBot paper for an example of where they didn’t use the “group by IP” approach, and wound up in trouble.

  5. Laki permalink
    November 30, 2016 2:39 am

    But that means a single machine cannot visit even one page from each site hosted at that IP address, in one month. Not to speak about having ‘fresh’ index for most of those sites. I assume such IPs have a way of withstanding high crawl demands. Wouldn’t a better solution be to detect non-responsive IPs/hosts and postpone those URLs for a while?

    Also, robots.txt directives are relevant to hosts, not their common IP addresses.

    Awesome paper btw! And impressive speeds!!

    • January 4, 2017 7:13 am

      Hi Laki – you can override the “group by IP address” behavior by changing the call in ProcessRobotsTask to GroupingKey.makeGroupingKey() – use the domain instead of the host IP address.

      And yes, an alternative approach would be to detect response times and vary request rates accordingly – several crawlers do that, but you can still easily make a web site owner very unhappy by chewing up too much bandwidth on a server, regardless of the response performance.

      For web mining, it’s critical to be a very, very well behaved citizen. You’re using a web site’s resources without any obvious benefit to the provider, so in Bixo we tried hard to avoid causing any problems. Thus the very conservative approach to crawling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: