strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
Hi, I would like to take the HTML documents that are the result of a Solr search and combine them into a single HTML document that combines the body text of each individual document. What is a good strategy for this? I am crawling with Nutch and Carrot2 for clustering. Fred

Re: strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
can you say a bit more about this? I see Velocity and will download it and start playing around but I am not quite sure I understand all the steps that you are suggesting. Fred On Thu, Sep 22, 2011 at 19:51, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Solr support the Velocity

Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
This seems to be out of date. I am running Solr 3.4 * the file structure of apachehome/contrib is different and I don't see velocity anywhere underneath * the page referenced below only talks about Solr 1.4 and 4.0 ? On Thu, Sep 22, 2011 at 19:51, Markus Jelsma markus.jel...@openindex.iowrote:

Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
at 11:57, Fred Zimmerman w...@nimblebooks.com wrote: This seems to be out of date. I am running Solr 3.4 * the file structure of apachehome/contrib is different and I don't see velocity anywhere underneath * the page referenced below only talks about Solr 1.4 and 4.0 ? On Thu, Sep 22, 2011

Re: strategy for post-processing answer set

2011-09-24 Thread Fred Zimmerman
erik.hatc...@gmail.com wrote: conf/velocity by default. See Solr's example configuration. Erik On Sep 23, 2011, at 12:37, Fred Zimmerman w...@nimblebooks.com wrote: ok, answered my own question, found velocity rw in solrconfig.xml. next question: where does velocity look for its

http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
This http request works as desired (bringing back a csv file) http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=onversion=2.2q=battleshipwt=csv; but the same URL submitted via wget produces the 500 error reproduced below. I want the wget to download the csv file. What's going on?

Re: http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
got it. curl http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select/?indent=onq=videofl=name,idwt=csv; works like a champ. On Tue, Oct 4, 2011 at 15:35, Fred Zimmerman w...@nimblebooks.com wrote: This http request works as desired (bringing back a csv file) http://zimzazsearch3-1

more like this

2011-10-05 Thread Fred Zimmerman
Hi, for my application, I would like to be able to create web queries (wget/curl) that get more like this for either a single arbitrarily specified URL or for the first x terms in a search query. I want to return the results to myself as a csv file using wt=csv. How can I accomplish the MLT

how to add search terms to output of wt=csv?

2011-10-14 Thread Fred Zimmerman
Hi, I want to include the search query in the output of wt=csv (or a duplicate of it) so that the process that receives this output can do something with the search terms. How would I accomplish this? Fred

changing base URLs in indexes

2011-10-18 Thread Fred Zimmerman
Hi, I am getting ready to index a recent copy of Wikipedia's pages-articles dump. I have two servers, foo and bar. On foo.com/mediawiki I have a Mediawiki install serving up the pages. On bar.com/solr I have my solr install. I have the pages-articles.xml file from Wikipedia and the solr

dataimport indexing fails: where are my log files ? ;-)

2011-10-19 Thread Fred Zimmerman
dumb question ... today I set up solr3.4/example, indexing to 8983 via post is working, so is search, solr/dataimport reports str name=Total Rows Fetched0/str str name=Total Documents Processed0/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-10-19 18:13:57/str str

where is solr data import handler looking for my file?

2011-10-19 Thread Fred Zimmerman
Solr dataimport is reporting file not found when it looks for foo.xml. Where is it looking for /data? is this an url off the apache2/htdocs on the server, or is it an URL within example/solr/...? entity name=page processor=XPathEntityProcessor stream=true

success with indexing Wikipedia - lessons learned

2011-10-21 Thread Fred Zimmerman
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/

Re: where is solr data import handler looking for my file?

2011-10-23 Thread Fred Zimmerman
to solve. Offhand, it looks as though you're trying to do something with DIH that it wasn't intended to do. But that's just a guess since the details of what you're trying to do are so sparse... Best Erick On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman zimzaz@gmail.com wrote: Solr

schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
Hi, it seems from my limited experience thus far that as new data types are added, schema.xml will tend to become bloated with many different field and fieldtype definitions. Is this a problem in real life, and if so, what strategies are used to address it? FredZ

Re: schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
So, basically, yes, it is a real problem and there is no designed solution? e.g. optional sub-schema files that can be turned off and on? On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher erik.hatc...@gmail.comwrote: On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote: it seems from my limited

Re: Is there a good web front end application / interface for solr

2011-10-25 Thread Fred Zimmerman
what about something that's a bit less discovery-oriented? for my particular application I am most concerned with bringing back a straightforward top ten answer set and having users look at it. I actually don't want to bother them with faceting, etc. at this juncture. Fred On Tue, Oct 25, 2011

missing core name in path

2011-10-26 Thread Fred Zimmerman
It is not a multi-core setup. The solr.xml has null value for cores. ? HTTP ERROR 404 Problem accessing /solr/admin/index.jsp. Reason: missing core name in path 2011-10-26 13:40:21.182:WARN::/solr/admin/ java.lang.IllegalStateException: STREAM at

fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
It's a small indexing job coming from nutch. 2011-10-26 15:07:29,039 WARN mapred.LocalJobRunner - job_local_0011 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executi$ at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$ at

Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
/heaplog ... Heap dump file created [306866344 bytes in 32.376 secs] On Wed, Oct 26, 2011 at 11:09 AM, Fred Zimmerman zimzaz@gmail.comwrote: It's a small indexing job coming from nutch. 2011-10-26 15:07:29,039 WARN mapred.LocalJobRunner - job_local_0011 java.io.IOException

Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
by itself. On Wed, Oct 26, 2011 at 1:01 PM, Fred Zimmerman zimzaz@gmail.comwrote: More on what's happening. It seems to be timing out during the commit. The new documents are small, but the existing index is large (11 million records). INFO: Closing Searcher@4a7df6 main fieldValueCache

limiting searches to particular sources

2011-11-02 Thread Fred Zimmerman
I want to be able to list some searches to particular sources, e.g. wiki only, crawled only, etc. So I think I need to create a source field in the schema.xml. However, the native data for these sources does not contain source info (e.g. crawled). So I want to use (I think) copyfield to add a

Re: limiting searches to particular sources

2011-11-04 Thread Fred Zimmerman
If you're crawling the data by yourself, you can just add the source to the document. If you're using DIH, you can specify the field as a constant. Or you could implement a custom Transformer that inserted it for you. Best Erick On Wed, Nov 2, 2011 at 10:52 AM, Fred Zimmerman zimzaz

Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Fred Zimmerman
Any options that do not require adding new software? On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Shaun: You should try NRT available with Solr with RankingAlgorithm here. You should be able to add docs in real time and also query them in real

remove answers with identical scores

2011-11-24 Thread Fred Zimmerman
I have a corpus that has a lot of identical or nearly identical documents. I'd like to return only the unique ones (excluding the nearly identical which are redirects). I notice that all the identical/nearly identicals have identical Solr scores. How can I tell Solr to throw out all the

Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
LSH clustering. On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman zimzaz@gmail.com wrote: I have a corpus that has a lot of identical or nearly identical documents. I'd like to return only the unique ones (excluding the nearly identical which are redirects). I notice that all