Running Bixo in EC2
Bixo uses Cascading to define the execution of tasks. And Cascading is built on top of Hadoop, the leading open source MapReduce framework.
This means that Bixo will scale out to many servers, with little or no extra effort. You can run Bixo on your laptop, and you can run Bixo on a cluster of 100 servers, depending on your needs.
And it also means that Bixo runs well in Amazon’s Elastic Computer Cloud (EC2). With a few commands you can create a Bixo cluster, fetch and process the target web pages, save the results, and dispose of the cluster – without buying, provisioning or maintaining any hardware.
This next section will provide step-by-step instructions for running Bixo in EC2.
WARNING: Information about configuring EC2 on the Hadoop and AWS sites will be different than what’s described below. Please follow these steps, which leverage as much as possible the automation available via modified versions of Hadoop Bash scripts.
Getting an AWS account
The very first thing you need to do is to create an AWS account. This creates the various keys you need to interact with EC2, and sets up billing for the time that you use.
- Sign up for an AWS account, if you don’t already have one.
- Go to the Amazon EC2 Getting Started Guide page. Note that you don’t need to do everything described here, just the specific items listed below.
- Follow the steps in the Setting up an Account/Signing up for Amazon EC2 section. Once you’ve followed these steps, you’ll have the X.509 certificate and private key, as well as your AWS account ID and your Access Key ID and Secret Key ID. Note that this is a direct link to the “Access Identifiers” page. The AWS account ID is also called the “Account Number” on some AWS pages, and is a 12 digit number with the format xxxx-xxxx-xxxx. Also note that you must sign up for an EC2 account in addition to your AWS account.
- Wait until you’ve gotten confirmation from Amazon that EC2 access has been activated. This could take several days. To verify, log into your AWS Account, then click on the Account Activity link and verify that under the “Amazon Elastic Compute Cloud” item it says “View/Edit Service” and not “application in process”.
Configuring Bixo for EC2
- Follow the Getting Started instructions to download Bixo onto your hard disk.
- Create a new directory on your hard disk for your AWS key information. If you have a fork of Bixo, make sure this directory is NOT in your git directory, so that you don’t accidentally push your secret key information into GitHub.
- Inside this directory, copy the X.509 cert-<id>.pem and pk-<id>.pem files from the Getting an AWS Account procedure above.
- Inside this directory, create a file called accountid that contains your EC2 user id. This is the same as your AWS account ID, but without any dashes.
- Inside this directory, create a file called accesskey that contains your Access Key ID.
- Inside this directory, create a file called secretkey that contains your Secret Key ID.
- In your project directory (for the purpose of demonstration you can use the “examples” directory that comes with the distribution as your project) create a new directory called “ec2″.
- In your <path to project>/ec2/ directory, create a file called .local.awskey-path. This should contain the full path to the AWS key directory that you’ve populated above.
- Copy the file <bixo dist>/ec2/setenv.sh.project-template to your “ec2″ directory and rename it as setenv.sh (you may want to go over that file to see if you need to define any shell variables – e.g. setting up BIXO_HOME).
% cd <path to project>/ec2/
% . setenv.sh
% ec2-add-keypair <keypair name>Use something short like “myprojectname” for the keypair name.
- Copy the output of everything between (and including) the “—–BEGIN RSA PRIVATE KEY—–” and “—–END RSA PRIVATE KEY—–” lines to create a file called id_rsa-<keypair name> in the AWS key directory.
- Set privileges on the id_rsa-<keypair name> file to be read-write for only the user:
% chmod go= id_rsa-myaws
Congratulations, you are now set up to run Bixo in EC2.
Note: If you’re using Cygwin on Windows, make sure you have ssh installed in your Cygwin environment.
Setting up ElasticFox
ElasticFox is a free Firefox extension that lets you monitor your EC2 servers, find out information about them, configure public access, etc.
Follow these steps to install and configure ElasticFox:
- Launch Firefox (version 3.0 or later)
- Download the Elasticfox extension by clicking this link.
- If you get a dialog asking you what to do with the download file, select “Open with…” and choose Firefox.
- Select Tools > Elasticfox.
- The “Enter AWS credentials” dialog should be automatically displayed. Choose any Account Name, enter your AWS Access and Secret Access Keys, and click the “Add” button, followed by the “Close”button.
- Click the icon to the left of “Account IDs”, choose any Display Name, enter your AWS Account ID, and click the “Add” button.
- Make sure the “Regions” popup is set to “us-east-1″.
Setting up FoxyProxy
FoxyProxy is a free Firefox extension that can use a local proxy to correctly handle internal EC2 URLs.
Follow these steps to install and configure FoxyProxy:
- Launch Firefox (version 3.0 or later)
- Download the FoxyProxy extension by clicking this link.
- Configure the FoxyProxy extension as follows (sorry, lots of steps here):
- Select Tools > Add-ons
- Select the “Extensions” window tab at the top.
- Select the “FoxyProxy” item from the list.
- Click on the Preferences button.
- Click the “Add New Proxy” button.
- Select “Manual Proxy Configuration” radio button.
- Enter “localhost” for the “Host or IP Address” field.
- Enter “6666″ for the “Port” field.
- Select the “SOCKS proxy?” checkbox.
- Select the “SOCKS v5″ radio button.
- Click on the “General” tab at the top of the dialog box.
- Enter “EC2″ for the “Proxy Name” field.
- Click on the “URL Patterns” tab at the top of the dialog box.
- Click the “Add New Pattern” button.
- Enter “EC2-1″ for the “Pattern Name” field.
- Enter “*ec2*.amazonaws.com*” for the “URL pattern” field (not case sensitive)
- Select the “Whitelist” and “Wildcards” radio buttons.
- Click the “OK” button to dismiss the new URL pattern dialog box.
- Repeat the above five steps to add EC2-2, -3, and -4 patterns for “*ec2.internal*”, “*compute-1.amazonaws.com*”, and “*compute-1.internal*”.
- Click the “OK” button to dismiss the new proxy dialog box.
- Select “Use proxies bsaed on their pre-defined patterns and priorities” from the Mode popup menu at the top of the FoxyProxy window.
- You should now have a window that looks like this:
- Close the FoxyProxy Options dialog box.
- Restart Firefox.
- At the bottom right corner of your browser, you should now have a new “FoxyProxy Patterns” label.
Launching a Bixo EC2 Cluster
You are now ready to launch a Bixo cluster in EC2.
% cd <path to project>/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster <cluster name> <number of slave servers> [<instance type> [<max spot price>]]For example “
% hadoop-ec2 launch-cluster mybixocluster 2“. Ignore the many “
[Deprecated] Xalan : xxx” messages that annoyingly appear in the terminal window. If no
<instance type>is specified, it defaults to
DEFAULT_INSTANCE_TYPE(see the notes in <path to bixo dist>/ec2/setenv.sh and <path to bixo dist>/ec2/hadoop-aws/etc/hadoop-ec2-env.sh for more details). If
<max spot price>is specified, the slave instances will be take advantage of spot pricing (typically only a third of On Demand pricing). Spot-priced instances aren’t launched immediately, however. Requests for spot instances are made, and the requests satisfied somewhat later (minutes or hours, depending on the number and type of instances).
- Open a new Firefox browser window and select Tools > Elasticfox
- Wait for your <cluster name>-master and <cluster name> slave servers to show up in the Elasticfox list. Eventually it will look something like this:
- Open a new terminal window
% cd <path to project>/ec2
% . setenv.sh
% hadoop-ec2 proxy <cluster name>This will start up a local proxy, and output URLs to use for monitoring the cluster.
- Open a new browser window and paste the JobTracker URL output from the proxy. This will let you track the actual Hadoop jobs once Bixo starts running.
- Open a new browser window and paste the NameNode URL output from the proxy. This will let you view files in HDFS (Hadoop Distributed File System) that are generated by Bixo.
Running a Bixo job
Once your cluster is up and running, you can run an actual Bixo job on it. To run one of the examples that come with the Bixo distribution you need to build a Hadoop job jar first.
% cd <path to examples>
% . setenv.sh
% hadoop-ec2 push <cluster name> build/bixo-examples-job-<version>.jar.This may take a while, as the large jar has to be uploaded from your machine to the <cluster name>-master server.
% hadoop-ec2 screen <cluster name>This logs you into the master server and creates a screen.
% hadoop jar bixo-examples-job-<version>.jar -domain yahoo.com -numloops 2 -outputdir output -agentname <your agent name>
The manifest file for the “bixo-examples” job jar has the DemoCrawlTool defined as the Main Class, so the above command will start a crawl of the yahoo.com domain, and do two loops. You can monitor the progress of the crawl via the browser window you opened with the JobTracker URL from the proxy. In addition, you can watch files being created in HDFS by browsing the HDFS file system in the browser window you opened with the NameNode URL from the proxy.
Use the ctrl-a ctrl-d key sequence to detach from the screen and return control of your terminal window to your local machine. The job will continue running, and you can re-attach to the same screen by re-running the “hadoop-ec2 screen <cluster>” command.
Finally, when your job is done then please DO NOT forget to terminate your cluster, as otherwise you’ll continue to incur Amazon EC2 charges. To terminate, run the “hadoop-ec2 terminate-cluster <cluster>” command. It will take anywhere from 30 seconds to a few minutes before the details of the cluster have been collected, at which time you’ll be prompted to confirm termination by a command line prompt that says “Terminate all instances? [yes or no]: ” Enter “yes”, and your cluster will be terminated.
SSHing into slaves – First use the hadoop-ec2 login command to log into the master, then just
% ssh <internal server name>All servers in the cluster are configured for keyless SSH.
Copying files from EC2 servers – Use the scp tool, as in:
% scp $SSH_OPTS root@<public DNS name>:<path to file> <local path>The $SSH_OPTS shell variable has been set up by the setenv.sh script with the values needed to access the EC2 servers.
Shutting down your cluster – A useful command to know is
% hadoop-ec2 list which lists your currently running clusters. Once you have finished your work make sure that you shut down your cluster
% hadoop-ec2 terminate-cluster <cluster name>