Hi,
I would like to take the HTML documents that are the result of a Solr search
and combine them into a single HTML document that combines the body text of
each individual document. What is a good strategy for this? I am crawling
with Nutch and Carrot2 for clustering.
Fred
can you say a bit more about this? I see Velocity and will download it and
start playing around but I am not quite sure I understand all the steps that
you are suggesting. Fred
On Thu, Sep 22, 2011 at 19:51, Markus Jelsma markus.jel...@openindex.iowrote:
Hi,
Solr support the Velocity
This seems to be out of date. I am running Solr 3.4
* the file structure of apachehome/contrib is different and I don't see
velocity anywhere underneath
* the page referenced below only talks about Solr 1.4 and 4.0
?
On Thu, Sep 22, 2011 at 19:51, Markus Jelsma markus.jel...@openindex.iowrote:
at 11:57, Fred Zimmerman w...@nimblebooks.com wrote:
This seems to be out of date. I am running Solr 3.4
* the file structure of apachehome/contrib is different and I don't see
velocity anywhere underneath
* the page referenced below only talks about Solr 1.4 and 4.0
?
On Thu, Sep 22, 2011
erik.hatc...@gmail.com wrote:
conf/velocity by default. See Solr's example configuration.
Erik
On Sep 23, 2011, at 12:37, Fred Zimmerman w...@nimblebooks.com wrote:
ok, answered my own question, found velocity rw in solrconfig.xml. next
question:
where does velocity look for its
This http request works as desired (bringing back a csv file)
http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=onversion=2.2q=battleshipwt=csv;
but the same URL submitted via wget produces the 500 error reproduced below.
I want the wget to download the csv file. What's going on?
got it.
curl
http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select/?indent=onq=videofl=name,idwt=csv;
works like a champ.
On Tue, Oct 4, 2011 at 15:35, Fred Zimmerman w...@nimblebooks.com wrote:
This http request works as desired (bringing back a csv file)
http://zimzazsearch3-1
Hi,
for my application, I would like to be able to create web queries
(wget/curl) that get more like this for either a single arbitrarily
specified URL or for the first x terms in a search query. I want to return
the results to myself as a csv file using wt=csv. How can I accomplish the
MLT
Hi,
I want to include the search query in the output of wt=csv (or a duplicate
of it) so that the process that receives this output can do something with
the search terms. How would I accomplish this?
Fred
Hi,
I am getting ready to index a recent copy of Wikipedia's pages-articles
dump. I have two servers, foo and bar. On foo.com/mediawiki I have a
Mediawiki install serving up the pages. On bar.com/solr I have my solr
install. I have the pages-articles.xml file from Wikipedia and the solr
dumb question ...
today I set up solr3.4/example, indexing to 8983 via post is working, so is
search, solr/dataimport reports
str name=Total Rows Fetched0/str
str name=Total Documents Processed0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-10-19 18:13:57/str
str
Solr dataimport is reporting file not found when it looks for foo.xml.
Where is it looking for /data? is this an url off the apache2/htdocs on the
server, or is it an URL within example/solr/...?
entity name=page
processor=XPathEntityProcessor
stream=true
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/
to
solve. Offhand, it looks as though you're trying to do something
with DIH that it wasn't intended to do. But that's just a guess
since the details of what you're trying to do are so sparse...
Best
Erick
On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman zimzaz@gmail.com
wrote:
Solr
Hi,
it seems from my limited experience thus far that as new data types are
added, schema.xml will tend to become bloated with many different field and
fieldtype definitions. Is this a problem in real life, and if so, what
strategies are used to address it?
FredZ
So, basically, yes, it is a real problem and there is no designed solution?
e.g. optional sub-schema files that can be turned off and on?
On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher erik.hatc...@gmail.comwrote:
On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote:
it seems from my limited
what about something that's a bit less discovery-oriented? for my particular
application I am most concerned with bringing back a straightforward top
ten answer set and having users look at it. I actually don't want to bother
them with faceting, etc. at this juncture.
Fred
On Tue, Oct 25, 2011
It is not a multi-core setup. The solr.xml has null value for cores. ?
HTTP ERROR 404
Problem accessing /solr/admin/index.jsp. Reason:
missing core name in path
2011-10-26 13:40:21.182:WARN::/solr/admin/
java.lang.IllegalStateException: STREAM
at
It's a small indexing job coming from nutch.
2011-10-26 15:07:29,039 WARN mapred.LocalJobRunner - job_local_0011
java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error
executi$
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
at
/heaplog ...
Heap dump file created [306866344 bytes in 32.376 secs]
On Wed, Oct 26, 2011 at 11:09 AM, Fred Zimmerman zimzaz@gmail.comwrote:
It's a small indexing job coming from nutch.
2011-10-26 15:07:29,039 WARN mapred.LocalJobRunner - job_local_0011
java.io.IOException
by itself.
On Wed, Oct 26, 2011 at 1:01 PM, Fred Zimmerman zimzaz@gmail.comwrote:
More on what's happening. It seems to be timing out during the commit.
The new documents are small, but the existing index is large (11 million
records).
INFO: Closing Searcher@4a7df6 main
fieldValueCache
I want to be able to list some searches to particular sources, e.g. wiki
only, crawled only, etc. So I think I need to create a source field in
the schema.xml. However, the native data for these sources does not
contain source info (e.g. crawled). So I want to use (I think)
copyfield to add a
If you're crawling the data by yourself, you can just add the source
to the document.
If you're using DIH, you can specify the field as a constant. Or you
could implement a custom Transformer that inserted it for you.
Best
Erick
On Wed, Nov 2, 2011 at 10:52 AM, Fred Zimmerman zimzaz
Any options that do not require adding new software?
On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya
nnagaraja...@transaxtions.com wrote:
Shaun:
You should try NRT available with Solr with RankingAlgorithm here. You
should be able to add docs in real time and also query them in real
I have a corpus that has a lot of identical or nearly identical documents.
I'd like to return only the unique ones (excluding the nearly identical
which are redirects). I notice that all the identical/nearly identicals
have identical Solr scores. How can I tell Solr to throw out all the
LSH clustering.
On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman zimzaz@gmail.com
wrote:
I have a corpus that has a lot of identical or nearly identical
documents.
I'd like to return only the unique ones (excluding the nearly identical
which are redirects). I notice that all
26 matches
Mail list logo