Introduction ============ ``bg.crawler`` is a command-line frontend for feeding a tree of files (a directory) into a Solr for indexing. Requirements ============ * Python 2.6 or Python 2.7 (no support for Python 3) * ``curl`` Installation ============ * use ``easy_install bg.crawler`` - this should install a script ``solr-crawler`` inside the ``bin`` folder of your Python installation. You are strongly encouraged to use ``virtualenv`` for creating a virtualized Python environment. Usage ===== Command line options:: usage: solr-crawler [-h] [--solr-url SOLR_URL] [--render-base-url RENDER_BASE_URL] [--max-depth MAX_DEPTH] [--commit-after COMMIT_AFTER] [--tag TAG] [--clear-all] [--optimize] [--guess-encoding] [--clear-tag SOLR_CLEAR_TAG] [--verbose] [--no-type-check] A command-line crawler for importing all files within a directory into Solr positional arguments: Directory to be crawled optional arguments: -h, --help show this help message and exit --solr-url SOLR_URL, -u SOLR_URL SOLR server URL --render-base-url RENDER_BASE_URL, -r RENDER_BASE_URL Base URL for server delivering crawled content --max-depth MAX_DEPTH, -d MAX_DEPTH maximum folder depth --commit-after COMMIT_AFTER, -C COMMIT_AFTER Solr commit after N documents --tag TAG, -t TAG Solr import tag --clear-all, -c Clear the Solr indexes before crawling --optimize, -O Optimize Solr index after import --guess-encoding, -g Guess encoding of input data --clear-tag SOLR_CLEAR_TAG Remove all items from Solr indexed tagged with the given tag --verbose, -v Verbose logging --no-type-check, -n Do not apply internal extension filter while crawling Have fun! * ``--solr-url`` defines the URL of the SOLR server * ``--render-base-url`` can be specified in order specify an URL prefix in order to calculate the value of the ``renderurl`` field within Solr. The value of ``renderurl`` is the concatenation of the value for ``render-base-url`` and the relative path of the crawled file to the crawler start directory. This is option is useful for generating a link using the ``renderurl`` if the file is served through a given web server (by its URL). * ``--max-depth`` limits the crawler to a given folder depth * ``--commit-after`` can be used to specify a numeric value to import the documents into batches with a Solr commit operation after each batch instead of committing after each individual document. * ``--tag`` will tag the imported document(s) with a string (this may be useful importing different document sources into Solr while supporting the option to filter by tag at query time) * ``--clear-all`` clear the complete Solr index before running the import * ``--clear-tag`` remove all documents with the given tag before running the import * ``--verbose`` enable extensive logging * ``--no-type-check`` if set: do not apply any type check filtering but instead pass all file types to Solr Internals ========= * uses the ``python-magic`` module to determine the mimetype of files to be imported * currently deals with HTML and plain text files * HTML files are currently parsed internally and converted to plain text Sourcecode ========== https://github.com/zopyx/bg.crawler Bug tracker =========== https://github.com/zopyx/bg.crawler/issues Solr setup ========== You can use the buildout configuration from https://raw.github.com/zopyx/bg.crawler/master/solr-3.4.cfg as an example how to setup a Solr instance for using ``bg.crawler``. It is important that the following field type definition is available within your Solr instance:: index = name:text type:text stored:true name:title type:text stored:true name:created type:date stored:true required:true name:modified type:date stored:true name:filesize type:integer stored:true name:mimetype type:string stored:true name:id type:string stored:true required:true name:relpath type:string stored:true name:fullpath type:string stored:true name:renderurl type:string stored:true name:tag type:string stored:true After running ``buildout`` you can start the Solr instance using:: bin/solr-instance fg|start Licence ======= ``bg.crawler`` is published under the GNU Public Licence V2 (GPL 2) Author ====== | ZOPYX Ltd. | Charlottenstr. 37/1 | D-72070 Tuebingen | Germany | info@zopyx.com | www.zopyx.com