from:"Adam Estrada"

Re: Regex replacement not working!

2011-06-29 Thread Adam Estrada

I have had the same problems with regex and I went with the regular pattern
replace filter rather than the charfilter. When I added it to the very end
of the chain, only then would it work...I am on Solr 3.2. I have also
noticed that the HTML filter factory is not working either. When I dump the
field that it's supposed to be working on, all the hyperlinks and everything
that you would expect to be stripped are still present.

Adam

On Wed, Jun 29, 2011 at 10:04 AM, samuele.mattiuzzo samum...@gmail.comwrote:

 ok, last question on the UpdateProcessor: can you please give me the steps
 to
 implement my own?
 i mean, i can push my custom processor in solr's code, and then what?
 i don't understand how i have to change the solrconf.xml and how can i bind
 that to the updater i just wrotea
 and also i don't understand how i do have to change the schema.xml

 i'm sorry for this question, but i started working on solr 5 days ago and
 for some things i really need a lot of documentation, and this isn't fully
 covered anywhere

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html
 Sent from the Solr - User mailing list archive at Nabble.com.

REGEX Proper Usage?

2011-06-17 Thread Adam Estrada

All,

I am having trouble getting my regex pattern to work properly. I have tried
PatternReplaceFilterFactory after the standard tokenizer

filter class=solr.PatternReplaceFilterFactory pattern=([^a-z0-9])
replacement=  replace=all/

and PatternReplaceCharFilterFactory before it.

charFilter class=solr.PatternReplaceCharFilterFactory
pattern=([^a-zA-Z0-9]) replacement=  replace=all/

It looks like this should work to remove everything except letters and
numbers.

charFilter class=solr.HTMLStripCharFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/

filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LengthFilterFactory min=2 max=999/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z0-9]) replacement=  replace=all/

I am left with quite a few facet items like this

int name=_ view1443/int
int name=view _1599/int

Can anyone suggest what may be going on here? I have verified that my regex
works properly here http://www.fileformat.info/tool/regex.htm

Adam

Re: Mahout Solr

2011-06-16 Thread Adam Estrada

You're right...It would be nice to be able to see the cluster results coming
from Solr though...

Adam

On Thu, Jun 16, 2011 at 3:21 AM, Andrew Clegg andrew.clegg+mah...@gmail.com
 wrote:

 Well, it does have the ability to pull TermVectors from an index:


 https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html#CreatingVectorsfromText-FromLucene

 Nothing Solr-specific about it though.

 On 15 June 2011 15:38, Mark static.void@gmail.com wrote:
  Apache Mahout is a new Apache TLP project to create scalable, machine
  learning algorithms under the Apache license. It is related to other
 Apache
  Lucene projects and integrates well with Solr.
 
  How does Mahout integrate well with Solr? Can someone explain a brief
  overview on whats available. I'm guessing one of the features would be
 the
  replacing of the Carrot2 clustering algorithm with something a little
 more
  sophisticated?
 
  Thanks
 



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: Mahout Solr

2011-06-15 Thread Adam Estrada

The only integration at this point (as far as I can tell) is that Mahout can
read the lucene index created by Solr. I agree that it would be nice to swap
out the Carrot2 clustering engine with Mahout's set of algorithms but that
has not been done yet. Grant has pointed out that you can use Solr's
callback system to fire off another task like a mahout job.

http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/

Adam

On Wed, Jun 15, 2011 at 10:38 AM, Mark static.void@gmail.com wrote:

 Apache Mahout is a new Apache TLP project to create scalable, machine
 learning algorithms under the Apache license. It is related to other Apache
 Lucene projects and integrates well with Solr.

 How does Mahout integrate well with Solr? Can someone explain a brief
 overview on whats available. I'm guessing one of the features would be the
 replacing of the Carrot2 clustering algorithm with something a little more
 sophisticated?

 Thanks

[Handling] empty fields

2011-06-15 Thread Adam Estrada

All,

I have a field foo with several thousand blank or non-existing records in
it. This is also my faceting field. My question is, how can I deal with this
field so that I don't get a blank facet at query time?

int name=5000/int
vs.
int name=Flickr1000/int

Adam

Re: Finding Keywords/Phrases

2011-06-12 Thread Adam Estrada

Hi Frank,

I have been working on something very similar and I am at the point where I
don't believe (and I could be totally wrong) that a pure Solr solution is
going to do this. I would look at Mahout and play with some of the machine
learning algorithms that it can run against a Lucene index. I have not
gotten any further than experimenting with it right now but so far it looks
promising.

Adam

On Sun, Jun 12, 2011 at 10:20 AM, Frank A fsa...@gmail.com wrote:

 I have a single copyfield that has a number of other fields copied to it.
 I'm trying to extract a list of keywords and common terms.  I realize it
 may not be a 100% dynamic and I may need to manually filter.  Right now I
 tried using a CommonGrams filter.  However, what I see is it creates tokens
 for both hot dog and hot dog.  Is there anyway from within solr
 configuration to treat hot's frequency as hot when not followed by dog.
 For example, right now I may see a term/frequency of:

 hot   8
 dog  6
 hot dog  6

 What I really want is:

 hot dog 6
 hot 2

 Any ideas?

[Mahout] Integration with Solr

2011-06-09 Thread Adam Estrada

Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the
core build but the docs say that it's not very good for very large indexes.
Anyone have thoughts on this?

Thanks,
Adam

[Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada

All,

I am at a bit of a loss here so any help would be greatly appreciated. I am
using the DIH to grab data from a DB. The field that I am most interested in
has anywhere from 1 word to several paragraphs worth of free text. What I
would really like to do is pull out phrases like Joe's coffee shop rather
than the 3 individual words. I have tried the KeywordTokenizerFactory and
that does seem to do what I want it to do but it is not actually tokenizing
anything so it does what I want it to for the most part but it's not
creating the tokens that I need for further analysis in apps like Mahout.

We can play with the combination of tokenizers and filters all day long and
see what the results are after a quick reindex. I typlically just view them
in Solitas as facets which may be the problem for me too. Does anyone have
an example fieldType they can share with me that shows how to
extract phrases if they are there from the data I described earlier. Am I
even going about this the right way? I am using today's trunk build of Solr
and here is what I have munged together this morning.

fieldType name=text_ws class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer 
 charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
 filter class=solr.ShingleFilterFactory maxShingleSize=4
outputUnigrams=true outputUnigramIfNoNgram=false/
 filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.TrimFilterFactory/
 /analyzer
/fieldType

Thanks,
Adam

Re: [Mahout] Integration with Solr

2011-06-09 Thread Adam Estrada

Thanks for the reply, Tommaso! I would like to see tighter integration like
in the way Nutch integrates with Solr. There is a single param that you set
which points to the Solr instance. My interest in Mahout is with it's
abitlity to handle large data and find frequency, co-location of data,
clustering, etc...All the algorithms that are in the core build are great
and I am just now wrapping my head around how to use them all.

Adam

On Thu, Jun 9, 2011 at 10:33 AM, Tommaso Teofili
tommaso.teof...@gmail.comwrote:

 Hello Adam,
 I've managed to create a small POC of integrating Mahout with Solr for a
 clustering task, do you want to use it for clustering only or possibly for
 other purposes/algorithms?
 More generally speaking, I think it'd be nice if Solr could be extended
 with
 a proper API for integrating clustering engines in it so that one can plug
 and exchange engines flawlessly (just need an Adapter).
 Regards,
 Tommaso

 2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com

  Has anyone integrated Mahout with Solr? I know that Carrot2 is part of
 the
  core build but the docs say that it's not very good for very large
 indexes.
  Anyone have thoughts on this?
 
  Thanks,
  Adam

Re: [Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada

Erick,

I totally understand that BUT the keyword tokenizer factory does a really
good job extracting phrases (or what look like phrases from) from my data. I
don't know why exactly but it does do it. I am going to continue working
through it to see if I can't figure it out ;-)

Adam

On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 The problem here is that none of the built-in filters or tokenizers
 have a prayer
 of recognizing what #you# think are phrases, since it'll be unique to
 your situation.

 If you have a list of phrases you care about, you could substitute a
 single token
 for the phrases you care about...

 But the overriding question is what determines a phrase you're
 interested in? Is it
 a list or is there some heuristic you want to apply?

 Or could you just recognize them at query time and make them into a
 literal phrase
 (i.e. with quotationmarks)?

 Best
 Erick

 On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  I am at a bit of a loss here so any help would be greatly appreciated. I
 am
  using the DIH to grab data from a DB. The field that I am most interested
 in
  has anywhere from 1 word to several paragraphs worth of free text. What I
  would really like to do is pull out phrases like Joe's coffee shop
 rather
  than the 3 individual words. I have tried the KeywordTokenizerFactory and
  that does seem to do what I want it to do but it is not actually
 tokenizing
  anything so it does what I want it to for the most part but it's not
  creating the tokens that I need for further analysis in apps like Mahout.
 
  We can play with the combination of tokenizers and filters all day long
 and
  see what the results are after a quick reindex. I typlically just view
 them
  in Solitas as facets which may be the problem for me too. Does anyone
 have
  an example fieldType they can share with me that shows how to
  extract phrases if they are there from the data I described earlier. Am I
  even going about this the right way? I am using today's trunk build of
 Solr
  and here is what I have munged together this morning.
 
  fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
   analyzer 
   charFilter class=solr.HTMLStripCharFilterFactory/
   charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.ShingleFilterFactory maxShingleSize=4
  outputUnigrams=true outputUnigramIfNoNgram=false/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.EnglishPossessiveFilterFactory/
   filter class=solr.EnglishMinimalStemFilterFactory/
   filter class=solr.ASCIIFoldingFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.TrimFilterFactory/
   /analyzer
  /fieldType
 
  Thanks,
  Adam

[Visualizations] from Query Results

2011-06-03 Thread Adam Estrada

Dear Solr experts,

I am curious to learn what visualization tools are out there to help me
visualize my query results. I am not talking about a language specific
client per se but something more like Carrot2 which breaks clusters in to
their knowledge tree and expandable pie chart. Sorry if those aren't the
correct names for those tools ;-) Anyway, what else is out there like
Carrot2 http://project.carrot2.org/ to help me visualize Solr query results?

Thanks for your input,
Adam

Re: [Visualizations] from Query Results

2011-06-03 Thread Adam Estrada

Otis and Erick,

Believe it or not, I did Google this and didn't come up with anything all
that useful. I was at the Lucene Revolution conference last year and saw
some prezos that had some sort of graphical representation of the query
results. The one from Basic Tech especially caught my attention because it
simply showed a graph of hits over time. I can do that using jQuery or
Raphael as he suggested. I have also been playing with the Carrot2
visualization tools which are pretty cool too which is why I pointed them
out in my original email. I was just curious to see if there were any
speciality type projects out there like Carrot2 that folks in the Solr
community are using.

Adam

On Fri, Jun 3, 2011 at 9:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Adam,

 Try this:
 http://lmgtfy.com/?q=search%20results%20visualizations

 In practice I find that visualizations are cool and attractive looking, but
 often text is more useful because it's more direct.  But there is room for
 graphical representation of search results, sure.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Adam Estrada estrada.adam.gro...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 7:13:39 AM
  Subject: [Visualizations] from Query Results
 
  Dear Solr experts,
 
  I am curious to learn what visualization tools are out  there to help me
  visualize my query results. I am not talking about a  language specific
  client per se but something more like Carrot2 which breaks  clusters in
 to
  their knowledge tree and expandable pie chart. Sorry if those  aren't the
  correct names for those tools ;-) Anyway, what else is out there  like
  Carrot2 http://project.carrot2.org/ to help me visualize Solr query
  results?
 
  Thanks for your input,
  Adam

GeoJSON Response Writer

2011-05-29 Thread Adam Estrada

All,

Has anyone modified the current json response writer to include the GeoJSON
geospatial encoding standard. See here: http://geojson.org/

Just curious...
Adam

Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Adam Estrada

Well...by default there is a pretty decent schema that you can use as a
template in the example project that builds with Solr. Tika is the library
that does the actual content extraction so it would be a good idea to try
the example project out first.

Adam

2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderara ezech...@gmail.com

  Hello everyone, i need to know if some has used solr for indexing and
  storing images (upt to 16MB) or binary docs.
 
  How does solr behaves with this type of docs? How affects performance?
 
  Thanks Everyone
 
  --
  __
  Ezequiel.
 
  Http://www.ironicnet.com
 



 --
 __
 Ezequiel.

 Http://www.ironicnet.com

Re: dataimport

2011-03-09 Thread Adam Estrada

Brian,

I had the same problem a while back and set the JAVA_OPTS env variable
to something my machine could handle. That may also be an option for
you going forward.

Adam

On Wed, Mar 9, 2011 at 9:33 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
This has since been fixed. The problem was that there was not enough memory
on the machine. It works just fine now.

On Tue, Mar 8, 2011 at 6:22 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

: INFO: Creating a connection for entity id with URL:
:
jdbc:mysql://localhost/researchsquare_beta_library?characterEncoding=UTF8zeroDateTimeBehavior=convertToNull
: Feb 24, 2011 8:58:25 PM
org.apache.solr.handler.dataimport.JdbcDataSource$1
: call
: INFO: Time taken for getConnection(): 137
: Killed
:
: So it looks like for whatever reason, the server crashes trying to do a
full
: import. When I add a LIMIT clause on the query, it works fine when the
LIMIT
: is only 250 records but if I try to do 500 records, I get the same
message.

...wow. that's ... weird.

I've never seen a java process just log Killed like that.

The only time i've ever seen a process log Killed is if it was
terminated by the os (ie: kill -9 pid)

What OS are you using? how are you running solr? (ie: are you using the
simple jetty example java -jar start.jar or are you using a differnet
servlet container?) ... are you absolutely certain your machine doens't
have some sort of monitoring in place that kills jobs if they take too
long, or use too much CPU?

-Hoss

Re: Tomcat EXE Source Code

2011-02-25 Thread Adam Estrada

Some of these links may help...

http://www.google.com/search?client=safarirls=enq=apache+tomcat+downloadie=UTF-8oe=UTF-8

Adam


On Feb 25, 2011, at 3:16 AM, rajini maski wrote:

  Can anybody help me to get the source code of the Tomcat exe
 file i.e, source code of the installation exe .
 
 Thanks..

Re: Datetime problems with dataimport

2011-02-22 Thread Adam Estrada

I logged an issue in Jira that relates to this and it looks like Yonik picked 
it up.

https://issues.apache.org/jira/browse/SOLR-2286

Adam


On Feb 22, 2011, at 9:07 AM, MOuli wrote:

 
 Ok i got it.
 
 It should look like -mm-ddThh:mm:ssZ
 for example: 2011-02-22T15:07:00Z
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html
 Sent from the Solr - User mailing list archive at Nabble.com.

[Solr] and CouchDB

2011-02-19 Thread Adam Estrada

I am curious to see if anyone has messed around with Solr and the
Couch-Lucene incarnation that is out there...I was passed this article
this morning and it really opened my eyes about CouchDB
http://m.readwriteweb.com/hack/2011/02/hacker-chat-max-ogden.php

Thoughts,
Adam

Re: Indexing AutoCAD files

2011-02-19 Thread Adam Estrada

Hi Vignesh,

I believe that you would have to incorporate GDAL in to Tika in order
to read the file and extract the proper metadata. This is entirely
doable but I don't know how to do it. There are companies out there
that specialize in this sort of thing so hopefully, one of them has
already contacted you outside of this list but I would love to see
some detailed instruction on how to integrate GDAL in to Tika.

Best of luck,
Adam

On Sat, Feb 19, 2011 at 12:31 AM, Vignesh Raj
vignesh...@greatminds.co.in wrote:
 Hi team,

 Is there a way lucene can index AutoCAD files - *.dwg files?

 If so, please let me know.

 Can you please provide some insight on the same?



 Thanks in advance..



 Regards

 Vignesh

Re: Indexing AutoCAD files

2011-02-19 Thread Adam Estrada

Hi Vignesh,

I believe that you would have to incorporate GDAL in to Tika in order
to read the file and extract the proper metadata. This is entirely
doable but I don't know how to do it. There are companies out there
that specialize in this sort of thing so hopefully, one of them has
already contacted you outside of this list but I would love to see
some detailed instruction on how to integrate GDAL in to Tika.

Best of luck,
Adam

On Sat, Feb 19, 2011 at 12:31 AM, Vignesh Raj
vignesh...@greatminds.co.in wrote:
 Hi team,

 Is there a way lucene can index AutoCAD files - *.dwg files?

 If so, please let me know.

 Can you please provide some insight on the same?



 Thanks in advance..



 Regards

 Vignesh

Re: Index Autocad

2011-02-19 Thread Adam Estrada

I think you may have already posted this same question but please
check VoyagerGIS out. They have some shit-hot software that is geared
specifically towards the archive and retrieval of geospatial data. I
suggest that you check it out!!!

w/r,
Adam


On Sat, Feb 19, 2011 at 2:33 AM, lucene lucene luc...@greatminds.co.in wrote:
 Hi team,

 Is there a way lucene can index AutoCAD files – “*.dwg” files?

 If so, please let me know.

 Can you please provide some insight on the same?



 Thanks in advance..



 Regards

 Vignesh

Re: Difference between Solr and Lucidworks distribution

2011-02-13 Thread Adam Estrada

I believe that the Lucid Works distro for Solr is free and as you mentioned 
they only appear to sell their services for it. I have used that version for 
several demos because it does seem to have all the bells and whistles already 
included and it's super easy to set up. The only downside in my case is that 
they are still on the official release version 1.4.1 which has an older version 
of PDFBox that doesn't parse PDF's generated from newer adobe software. Thanks 
Adobe ;-) It's easy enough to just rebuild Tika, PDFBox, FontBox, etc. and swap 
them out...If you want spatial support, you can use the plugin from the Spatial 
Solr project out of the Netherlands which is designed to support 1.4.1 and from 
what I can tell seems to work pretty well.

Anyway, when 4.0 is released, hopefully with the extended spatial support from 
projects like SIS and JTS, I hope to see the office distro version change from 
Lucid. 

Thanks for all hard work the Lucid Team has provided over the years!

Adam

On Feb 12, 2011, at 10:55 PM, Andy wrote:

 Now I'm confused.
 
 In http://www.lucidimagination.com/lwe/subscriptions-and-pricing, the price 
 of LucidWorks Enterprise Software is stated as FREE. I thought the price 
 for Production was for the support service, not for the software.
 
 But you seem to be saying that 'LucidWorks Enterprise' is a separate software 
 that isn't free. Did I misunderstand?
 
 --- On Sat, 2/12/11, Lance Norskog goks...@gmail.com wrote:
 
 From: Lance Norskog goks...@gmail.com
 Subject: Re: Difference between Solr and Lucidworks distribution
 To: solr-user@lucene.apache.org, markus.jel...@openindex.io
 Date: Saturday, February 12, 2011, 8:10 PM
 There are two distributions.
 
 The company is Lucid Imagination. 'Lucidworks for Solr' is
 the
 certified distribution of Solr 1.4.1, with several
 enhancements.
 
 Markus refers to 'LucidWorks Enterprise', which is LWE.
 This is a
 separate app with tools and a REST API for managing a Solr
 instance.
 
 Lance Norskog
 
 On Fri, Feb 11, 2011 at 8:36 AM, Markus Jelsma
 markus.jel...@openindex.io
 wrote:
 It is not free for production environments.
 http://www.lucidimagination.com/lwe/subscriptions-and-pricing
 
 On Friday 11 February 2011 17:31:22 Greg Georges
 wrote:
 Hello all,
 
 I just started watching the webinars from
 Lucidworks, and they mention
 their distribution which has an installer, etc..
 Is there any other
 differences? Is it a good idea to use this free
 distribution?
 
 Greg
 
 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com

Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada

Grant,

How could i stub this out not being a java guy? What is needed in order to do
this?

Licensing is always going to be an issue with JTS which is why I am interested
in the project SIS sitting in incubation right now.

I'm willing to put forth the effort if I had a little direction on how to
implement it from the peanut gallery ;-)

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

I just came across a ~nudge post over in the SIS list on what the status is
for that project. This got me looking more in to spatial mods with Solr4.0.
I found this enhancement in Jira.
https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David
mentions that he's already integrated JTS in to Solr4.0 for querying on
polygons stored as WKT.

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form
of WKT. You can then get the data in to Solr by running the following
command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be
included in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: [WKT] Spatial Searching

2011-02-09 Thread Adam Estrada

Thought I would share this on web mapping...it's a great write up and something
to consider when talking about working with spatial data.

http://www.tokumine.com/2010/09/20/gis-data-payload-sizes/

Adam

On Feb 9, 2011, at 7:03 AM, Grant Ingersoll wrote:

The show stopper for JTS is it's license, unfortunately. Otherwise, I think
it would be done already! We could, since it's LGPL, make it an optional
dependency, assuming someone can stub it out.

On Feb 8, 2011, at 11:18 PM, Adam Estrada wrote:

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: Architecture decisions with Solr

2011-02-09 Thread Adam Estrada

I tried the multi-core route and it gets too complicated and cumbersome to 
maintain. That is just from my own personal testing...It was suggested that 
each user have their own ID in a single index that you can query against 
accordingly. In the example schema.xml I believe there is a field called 
texttight or something like that that is meant for skew numbers. Give each user 
their own guid or md5 hash and add that as part of all your queries. That way, 
only their data are returned. It would be the equivalent of something like 
this...

SELECT * FROM mytable WHERE userid = '3F2504E04F8911D39A0C0305E82C3301' AND 

Grant Ingersoll gave a presentation at the Lucene Revolution conference that 
demonstrated that you can build a query to be as easy or as complicated as any 
SQL statement. Maybe he can share that PPT?

Adam

On Feb 9, 2011, at 2:47 PM, Sujit Pal wrote:

 Another option (assuming the case where a user can be granted access to
 a certain class of documents, and more than one user would be able to
 access certain documents) would be to store the access filter (as an OR
 query of content types) in an external cache (perhaps a database or an
 eternal cache that the database changes are published to periodically),
 then using this access filter as a facet on the base query.
 
 -sujit
 
 On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote:
 This application will be built to serve many users
 
 If this means that you have thousands of users, 1000s of VMs and/or
 1000s of cores is not going to scale.
 
 Have an ID in the index for each user, and filter using it.
 Then they can see only their own documents.
 
 Assuming that you are building an app that through which they
 authenticate  talks to solr .
 (i.e. all requests are filtered using their ID)
 
 -Glen
 
 On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com 
 wrote:
 From what I understand about multicore, each of the indexes are independant 
 from each other right? Or would one index have access to the info of the 
 other? My requirement is like you mention, a client has access only to his 
 or her search data based in their documents. Other clients have no access 
 to the index of other clients.
 
 Greg
 
 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: 9 février 2011 14:28
 To: solr-user@lucene.apache.org
 Subject: Re: Architecture decisions with Solr
 
 What about standing up a VM (search appliance that you would make) for
 each client?
 If there's no data sharing across clients, then using the same solr
 server/index doesn't seem necessary.
 
 Solr will easily meet your needs though, its the best there is.
 
 On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:
 
 Hello all,
 
 I am looking into an enterprise search solution for our architecture and I 
 am very pleased to see all the features Solr provides. In our case, we 
 will have a need for a highly scalable application for multiple clients. 
 This application will be built to serve many users who each will have a 
 client account. Each client will have a multitude of documents to index 
 (0-1000s of documents). After discussion we were talking about going 
 multicore and to have one index file per client account. The reason for 
 this is that security is achieved by having a separate index for each 
 client etc.. Is this the best approach? How feasible is it (dynamically 
 create indexes on client account creation. Is it better to go the faceted 
 search capabilities route? Thanks for your help
 
 Greg

[WKT] Spatial Searching

2011-02-08 Thread Adam Estrada

I just came across a ~nudge post over in the SIS list on what the status is for 
that project. This got me looking more in to spatial mods with Solr4.0.  I 
found this enhancement in Jira. 
https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions 
that he's already integrated JTS in to Solr4.0 for querying on polygons stored 
as WKT. 

It's relatively easy to get WKT strings in to Solr but does the Field type 
exist yet? Is there a patch or something that I can test out? 

Here's how I would do it using GDAL/OGR and the already existing csv update 
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form of 
WKT. You can then get the data in to Solr by running the following command.
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a 
daunting task but because JTS recognizes each geometry type it should be 
possible to work with them. 
Does anyone know of a patch or even when this functionality might be included 
in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

Re: Time fields

2011-02-02 Thread Adam Estrada

If your using a DIH you can configure it however you want. Here is a
snippet of my code. Note the DateTimeTransformer.

dataConfig
  dataSource type=JdbcDataSource
   name=bleh
   driver=net.sourceforge.jtds.jdbc.Driver
   
url=jdbc:jtds:sqlserver://localhost;databaseName=bleh;responseBuffering=adaptive;
   user=test
   password=test
   onError=skip/
  document
entity name=Entities
dataSource=JIEE
transformer=DateFormatTransformer
 query = SELECT
  EntityUID AS id,
  EntityType AS cat,
  EntityUIDParent AS pid,
  subject AS subject,
  summary AS summary,
  DateCreated AS eventdate,
  Latitude AS lat,
  Longitude AS lng,
  Type AS jtype,
  SupportCategory AS supcat,
  Cause AS cause,
  Status AS status,
  Urgency AS urgency,
  Priority AS priority,
  Coordinate AS coords
  FROM dbo.JIEESearchIndex

  field column=id name=id /
  field column=cat name=cat /
  field column=subject name=subject /
  field column=summary name=summary /
  field column=eventdate name=eventdate
dateTimeFormat=-MM-dd'T'HH:mm:ss.SSS'Z'/
  field column=lat name=lat /
  field column=lng name=lng /
  field column=coords name=coords /
  field column=jtype name=jtype /
  field column=supcat name=supcat /
  field column=cause name=cause /
  field column=status name=status /
  field column=urgency name=urgency /
/entity

On Wed, Feb 2, 2011 at 7:28 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 For time of day fields, NOT unix timestamp/dates, what is the best way to do
 that?

 I can think of seconds since beginning of day as integer
 OR
 string

 Any other ideas? Assume that I'll be using range queries. TIA.



  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.

[Failure] to start Solr 4.0

2011-01-28 Thread Adam Estrada

All,

I've checked out the latest code and built the root directory with ant compile 
and then I build the solr directory again using the ant dist command which 
gives me the lucene-libs directory and a couple others. Now Solr won't start.  
What am I missing???  This is as far as it gets.

mini:example Adam$ java -jar start.jar 
2011-01-28 17:14:23.402:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-01-28 17:14:23.605:INFO::jetty-6.1.26
2011-01-28 17:14:23.638:INFO::Started SocketConnector@0.0.0.0:8983

What couple possibly be the problem?

Adam

Re: [Failure] to start Solr 4.0

2011-01-28 Thread Adam Estrada

I found the problem...You HAVE to build the Solr directory using ant example 
in order for the web application to start properly. Sorry to post so many times.

Adam

On Jan 28, 2011, at 5:20 PM, Adam Estrada wrote:

 All,
 
 I've checked out the latest code and built the root directory with ant 
 compile and then I build the solr directory again using the ant dist command 
 which gives me the lucene-libs directory and a couple others. Now Solr won't 
 start.  What am I missing???  This is as far as it gets.
 
 mini:example Adam$ java -jar start.jar 
 2011-01-28 17:14:23.402:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
 2011-01-28 17:14:23.605:INFO::jetty-6.1.26
 2011-01-28 17:14:23.638:INFO::Started SocketConnector@0.0.0.0:8983
 
 What couple possibly be the problem?
 
 Adam

Re: Tika config in ExtractingRequestHandler

2011-01-27 Thread Adam Estrada

I believe that as along as Tika is included in a folder that is
referenced by solrconfig.xml you should be good. Solr will
automatically throw mime types to Tika for parsing. Can anyone else
add to this?

Thanks,
Adam

On Thu, Jan 27, 2011 at 5:06 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 The wiki page for the ExtractingRequestHandler says that I can add the
 following configuration:
 str name=tika.config/my/path/to/tika.config/str

 I have tried to google for an example of such a Tika config file, but
 haven't found anything.

 Erlend

 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada

There are a few tutorials out there.

1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
3. Build the latest from branch
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
this one.

http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

but add the solr parameter at the end bin/nutch crawl urls -depth 5
-topN 100 -solr http://localhost:8983/solr

This will automatically add the data nutch collected to Solr. For
larger files I would also increase your JAVA_OPTS env to something
like JAVA_OPTS=' Xmx2048m'

Adam




On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote:
 Thanks Adam, It seems like Nutch use to solve most of my concerns.
 i would be great if you can have share resources for Nutch with us.

 / Pankaj Bhatt.

 On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
          I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
     Is there any way through which i can specify this in my DIH
  Configuration.
     Here is my configuration:-
 
  document
       entity name=sd
         processor=FileListEntityProcessor
         fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
         recursive=false
         rootEntity=true
         transformer=DateFormatTransformer
  onerror=continue
         entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
           field column=Author name=author meta=true/
           field column=Content-Type name=title meta=true/
           !-- field column=title name=title meta=true/ --
           field column=text name=all_text/
         /entity
 
         !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
         field column=fileSize name=size/
         field column=file name=filename/
     /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.

Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada

I take that back...Use am currently using version 1.2 and make sure
that the latest versions of Tika and PDFBox is in the contrib folder.
1.3 is structured a bit differently and it doesn't look like there is
a contrib directory. Maybe one of the Nutch contributors can comment
on this?

Adam

On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 There are a few tutorials out there.

 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
 3. Build the latest from branch
 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
 this one.

 http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

 but add the solr parameter at the end bin/nutch crawl urls -depth 5
 -topN 100 -solr http://localhost:8983/solr

 This will automatically add the data nutch collected to Solr. For
 larger files I would also increase your JAVA_OPTS env to something
 like JAVA_OPTS=' Xmx2048m'

 Adam




 On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt panbh...@gmail.com wrote:
 Thanks Adam, It seems like Nutch use to solve most of my concerns.
 i would be great if you can have share resources for Nutch with us.

 / Pankaj Bhatt.

 On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

 I would just use Nutch and specify the -solr param on the command line.
 That will add the extracted content your instance of solr.

 Adam

 Sent from my iPhone

 On Jan 25, 2011, at 5:29 AM, pankaj bhatt panbh...@gmail.com wrote:

  Hi All,
          I need to index the documents presents in my file system at
 various
  locations (e.g. C:\docs , d:\docs ).
     Is there any way through which i can specify this in my DIH
  Configuration.
     Here is my configuration:-
 
  document
       entity name=sd
         processor=FileListEntityProcessor
         fileName=docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$
  *baseDir=G:\\Desktop\\*
         recursive=false
         rootEntity=true
         transformer=DateFormatTransformer
  onerror=continue
         entity name=tikatest
  processor=org.apache.solr.handler.dataimport.TikaEntityProcessor
  url=${sd.fileAbsolutePath} format=text dataSource=bin
           field column=Author name=author meta=true/
           field column=Content-Type name=title meta=true/
           !-- field column=title name=title meta=true/ --
           field column=text name=all_text/
         /entity
 
         !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
         field column=fileSize name=size/
         field column=file name=filename/
     /entity
  !--baseDir=../site--
   /document
 
  / Pankaj Bhatt.

Re: Indexing spatial columns

2011-01-24 Thread Adam Estrada

Hi MapButcher,

There are a couple things that are going on here. 

1. The spatial functionality is confusing between versions of Solr. I wish 
someone would update the solr Spatial Search wiki page. 
2.  You will want to use the jTDS Driver here instead of the one from 
Microsoft. http://jtds.sourceforge.net/ It works a little better.
3.  For Solr 4.0 you will basically have to concatenate the lat/long fields in 
to a single column which in the example schema is called store
4. I don't know if individual columns actually exist for latitude and longitude 
in 4.0 but in 1.4.x I know the lat/long type HAD to be called lat and lng and 
had to be tdouble type which I see below.
5. Revert back to Solr 1.4.x and try using their plugin 
http://www.jteam.nl/news/spatialsolr.html
6. Try your queries in the Solr admin tool first before trying to integrate 
this in to your code.

Overall, I have had great success with Solr Spatial in just doing a simple 
radius search. I am using the core 4.0 functionality and am having no problems. 
I will eventually get in to distance and bounding box queries do ehstever you 
figure out and share would be great!

Good luck,
Adam

On Jan 24, 2011, at 4:46 AM, mapbutcher wrote:

 
 Hi,
 
 I'm a bit of a solr beginner. I have installed Solr 4.0 and I'm trying to
 index some spatial data stored in a sql server instance. I'm using the
 DataImportHandler here is my data-comfig.xml:
 
 dataConfig
 dataSource type=JdbcDataSource
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://localhost\sqlserver08;databaseName=Spatial user=sa
 password=sqlserver08/
  document
entity name=poi query=select OBJECTID,CATEGORY,NAME,POINT_X,POINT_Y
 from NZ_POI
   field column=OBJECTID name=id/
   field column=CATEGORY name=category/
   field column=NAME name=name/
   field column=POINT_X name=lat/
   field column=POINT_Y name=lon/
   /entity
  /document
 /dataConfig
 
 In my schema file I have following definition:
 
   field name=category type=string indexed=true stored=true/
   field name=name type=string indexed=true stored=true/
   field name=lat type=tdouble indexed=true stored=true/   
   field name=lon type=tdouble indexed=true stored=true/
 
   copyField source=category dest=text/
   copyField source=name dest=text/
 
 I have completed a data import with no errors in the log as far as i can
 tell. However when i inspect the schema i do not see the columns names
 lat\lon. When sending the query:
 
 http://localhost:8080/Solr/select/?q=Camp AND _val_:recip(dist(2, lon, lat,
 44.794, -93.2696), 1, 1, 0)^100 
 
 I get an error undefined column. 
 
 Does anybody have any ideas about whether the above is the correct procedure
 for indexing spatial data?
 
 Cheers
 
 S
 
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Indexing-spatial-columns-tp2318493p2318493.html
 Sent from the Solr - User mailing list archive at Nabble.com.

[Building] Solr4.0 on Windows

2011-01-23 Thread Adam Estrada

All,

I am having problems building Solr trunk on my windows 7 machine. I
get the following errors...

BUILD FAILED
C:\Apache\Solr-Nightly\build.xml:23: The following error occurred while executin
g this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:529:
The following error occurred while executing this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
The following error occurred while executing this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
The following error occurred while executing this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
The following error occurred while executing this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
The following error occurred while executing this line:
C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!

I am full admin on my machine and made sure that I was running the
build as admin but it still fails. I just tired the same thing on the
Mac and ran it as sudo and it built perfectly. Any ideas?

Thanks,
Adam

Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Adam Estrada

+1 on Nutch!

On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 Please take a look at Apache Nutch. I can crawl through a file system over 
 FTP.
 After crawling, it can use Tika to extract the content from your PDF files and
 other. Finally you can then send the data to your Solr server for indexing.

 http://nutch.apache.org/

 Hi All,
   Is there is any way in SOLR or any plug-in through which the folders and
 documents in FTP location can be indexed.

 / Pankaj Bhatt.

Re: [Building] Solr4.0 on Windows

2011-01-23 Thread Adam Estrada

So I did manage to get this to build...

ant compile does it.

Didn't it used to use straight Maven? It's pretty hard to keep track of what's 
what...Anyway, is there any way/reason all the cool Lucene jars aren't getting 
copied in to $SOLR_HOME/lib?  That would really help and save a lot of time. 
Where in the build script would I need to change this?

Thanks,
Adam

On Jan 23, 2011, at 9:31 PM, Adam Estrada wrote:

 All,
 
 I am having problems building Solr trunk on my windows 7 machine. I
 get the following errors...
 
 BUILD FAILED
 C:\Apache\Solr-Nightly\build.xml:23: The following error occurred while 
 executin
 g this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:529:
 The following error occurred while executing this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
 The following error occurred while executing this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
 The following error occurred while executing this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
 The following error occurred while executing this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
 The following error occurred while executing this line:
 C:\Apache\Solr-Nightly\lucene\common-build.xml:511: Tests failed!
 
 I am full admin on my machine and made sure that I was running the
 build as admin but it still fails. I just tired the same thing on the
 Mac and ran it as sudo and it built perfectly. Any ideas?
 
 Thanks,
 Adam

Re: Solr Out of Memory Error

2011-01-19 Thread Adam Estrada

Is anyone familiar with the environment variable, JAVA_OPTS? I set
mine to a much larger heap size and never had any of these issues
again.

JAVA_OPTS = -server -Xms4048m -Xmx4048m

Adam

On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia isan.fu...@germinait.com wrote:
 Hi all,
 By adding more servers do u mean sharding of index.And after sharding , how
 my query performance will be affected .
 Will the query execution time increase.

 Thanks,
 Isan Fulia.

 On 19 January 2011 12:52, Grijesh pintu.grij...@gmail.com wrote:


 Hi Isan,

 It seems your index size 25GB si much more compared to you have total Ram
 size is 4GB.
 You have to do 2 things to avoid Out Of Memory Problem.
 1-Buy more Ram ,add at least 12 GB of more ram.
 2-Increase the Memory allocated to solr by setting XMX values.at least 12
 GB
 allocate to solr.

 But if your all index will fit into the Cache memory it will give you the
 better result.

 Also add more servers to load balance as your QPS is high.
 Your 7 Laks data makes 25 GB of index its looking quite high.Try to lower
 the index size
 What are you indexing in your 25GB of index?

 -
 Thanx:
 Grijesh
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Thanks  Regards,
 Isan Fulia.

Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada

Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
arnaud.gaudi...@gmail.comwrote:

 Hello,

 I would like to use BoilerPipe (a very good program which cleans the html
 content from surplus clutter).
 I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
 solr, am I right?

 How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
 with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

 Or do I need to modify some code inside Solr?

 I so something like TikaCLI -F in the tika forum (
 http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
 is it the right way?

 Thanks in advance,

 Arno.

Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-13 Thread Adam Estrada

Hi,

the following seems to work pretty well.

fieldType name=text_ws class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ShingleFilterFactory
  maxShingleSize=4 outputUnigrams=true
outputUnigramIfNoNgram=false /
  /analyzer
/fieldType

!-- A text field that uses WordDelimiterFilter to enable splitting and
matching of
words on case-change, alpha numeric boundaries, and non-alphanumeric
chars,
so that a query of wifi or wi fi could match a document
containing Wi-Fi.
Synonyms and stopwords are customized by external files, and
stemming is enabled.
The attribute autoGeneratePhraseQueries=true (the default) causes
words that get split to
form phrase queries. For example, WordDelimiterFilter splitting
text:pdp-11 will cause the parser
to generate text:pdp 11 rather than (text:PDP OR text:11).
NOTE: autoGeneratePhraseQueries=true tends to not work well for
non whitespace delimited languages.
--
fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

copyField source=cat dest=text/
copyField source=subject dest=text/
copyField source=summary dest=text/
copyField source=cause dest=text/
copyField source=status dest=text/
copyField source=urgency dest=text/

I ingest the source fields as text_ws (I know I've changed it a bit) and
then copy the field to text. This seems to do what you are asking for.

Adam

On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote:

 Hi all,

 I'm just stuck with exact keyword for several days. Hope you guys could
 help
 me. Here is the scenario:

   1. It need to be matched with multi-word keyword and case insensitive
   2. Partial word or single word matching with this field is not allowed

 I want to know the field type definition for this field and sample solr
 query. I need to combine this search with my full text search which uses
 dismax query.

 Thanks
 --
 Chhorn Chamnap
 http://chamnapchhorn.blogspot.com/

[sfield] Missing in Spatial Search

2011-01-13 Thread Adam Estrada

According to the documentation here:
http://wiki.apache.org/solr/SpatialSearch the field that identifies the
spatial point data is sfield. See the console output below.

Jan 13, 2011 6:49:40 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={spellcheck=truef.jtype.facet.mincoun
t=1facet=truef.cat.facet.mincount=1f.cause.facet.mincount=1f.urgency.facet.m
incount=1rows=10start=0q=*:*f.status.facet.mincount=1facet.field=catfacet.
field=jtypefacet.field=statusfacet.field=causefacet.field=urgency?=fq={!typ
e%3Dgeofilt+pt%3D39.0914154052734,-84.517822265625+sfield%3Dcoords+d%3D300}text:
} hits=113 status=0 QTime=1
Jan 13, 2011 6:51:51 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:  missing sfield for spatial
reques
t

Any ideas on this one?

Thanks in advance,
Adam

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

I believe this is what you are looking for. I renamed the field called
store to coords in the schema.xml file. The tricky part is building out
the query. I am using SolrNet to do this though and have not yet cracked the
problem.

http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500

Adam

On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote:


 Ok, this could be very easy to do but was not able to do this.
 Need to enable location search i.e. if someone searches for location 'New
 York' = show results for New York and results within 50 miles of New York.
 We do have latitude/longitude stored in database for each record but not
 sure how to index these values to enable spatial search.
 Any help would be much appreciated.

 thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

Actually, I by looking at the results from the geofilt filter it would
appear that it's not giving me the results I'm looking for. Or maybe it
is...I need to convert my results to KML to see if it is actually performing
a proper radius query.

http://localhost:8983/solr/select?q=*:*fq={!geofilt%20pt=39.0914154052734,-84.517822265625%20sfield=coords%20d=5000}http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000}

http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000}Please
let me know what you find.

Adam

On Wed, Jan 12, 2011 at 8:24 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 I believe this is what you are looking for. I renamed the field called
 store to coords in the schema.xml file. The tricky part is building out
 the query. I am using SolrNet to do this though and have not yet cracked the
 problem.


 http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq=%7B!bbox%7Dsfield=coordspt=32.15,-93.85d=500

 Adam

 On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote:


 Ok, this could be very easy to do but was not able to do this.
 Need to enable location search i.e. if someone searches for location 'New
 York' = show results for New York and results within 50 miles of New
 York.
 We do have latitude/longitude stored in database for each record but not
 sure how to index these values to enable spatial search.
 Any help would be much appreciated.

 thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

In my case, I am getting data from a database and am able to concatenate the
lat/long as a coordinate pair to store in my coords field. To test this, I
randomized the lat/long values and generated about 6000 documents.

Adam

On Wed, Jan 12, 2011 at 8:29 PM, caman aboxfortheotherst...@gmail.comwrote:


 Adam,

 thanks. Yes that helps
 but how does coords fields get populated? All I have is

 field name=lat type=tdouble indexed=true stored=true /
 field name=lng type=tdouble indexed=true stored=true /

 field name=coord type=location indexed=true stored=true /

 fields 'lat' and  'lng' get populated by dataimporthandler but coord, am
 not
 sure?

 Thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html
 Sent from the Solr - User mailing list archive at Nabble.com.

[Example] Compound Queries

2011-01-11 Thread Adam Estrada

All,

I have the following query which works just fine for querying a date range.
Now I would like to add any kind of spatial query to the mix. Would someone
be so kind as to help me out with an example spatial query that works in
conjunction with my date range query?

http://localhost:8983/solr/select/?q=hurricane+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]version=2.2start=0rows=10indent=on

I think it's something like this but my results are a not correct

http://localhost:8983/solr/select/?q=hurricane+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]sfield=storept=45.15,-93.85sort=geodist()%20ascversion=2.2start=0rows=10indent=on

Your feedback is greatly appreciated!
Adam

Re: DIH - Closing ResultSet in JdbcDataSource

2011-01-07 Thread Adam Estrada

This is my configuration which seems to work just fine.

?xml version=1.0 encoding=utf-8 ?
dataConfig
  dataSource type=JdbcDataSource
   name=DBImport
   driver=net.sourceforge.jtds.jdbc.Driver

url=jdbc:jtds:sqlserver://localhost;databaseName=50_DEV;responseBuffering=adaptive;
   user=test
   password=test
   onError=skip/
  document

From there it's just a matter of running the select statement and mapping it
against the correct fields in your index.

Adam

On Fri, Jan 7, 2011 at 2:40 PM, Shane Perry thry...@gmail.com wrote:

 Hi,

 I am in the process of migrating our system from Postgres 8.4 to Solr
 1.4.1.  Our system is fairly complex and as a result, I have had to define
 19 base entities in the data-config.xml definition file.  Each of these
 entities executes 5 queries.  When doing a full-import, as each entity
 completes, the server hosting Postgres shows 5 idle in transaction for
 the
 entity.

 In digging through the code, I found that the JdbcDataSource wraps the
 ResultSet object in a custom ResultSetIterator object, leaving the
 ResultSet
 open.  Walking through the code I can't find a close() call anywhere on the
 ResultSet.  I believe this results in the idle in transaction processes.

 Am I off base here?  I'm not sure what the overall implications are of the
 idle in transaction processes, but is there a way I can get around the
 issue without importing each entity manually?  Any feedback would be
 greatly
 appreciated.

 Thanks in advance,

 Shane

Re: [sqljdbc4.jar] Errors

2011-01-05 Thread Adam Estrada

I can't tell any difference in performance but it does work like a charm. At
least the messaging in the console is a lot more verbose.

Thank you very much for the heads up on this one ;-)

Adam

On Wed, Jan 5, 2011 at 4:29 AM, Gora Mohanty g...@mimirtech.com wrote:

 On Wed, Jan 5, 2011 at 10:18 AM, Estrada Groups
 estrada.adam.gro...@gmail.com wrote:
  I downloaded that driver today and will test it tomorrow. Thanks for the
 tip! Would you mind sending an XML code snippet if it's any different to
 load than the MS driver?
 [...]

 I presume that you are referring to the jTDS driver. The options are
 slightly
 different. Here is a snippet from the XML configuration of our
 DataImportHandler,
 with sensitive details obscured.
 dataSource type=JdbcDataSource name=jdbc
 driver=net.sourceforge.jtds.jdbc.Driver

 url=jdbc:jtds:sqlserver://db_server:port;databasename=dbname;responseBuffering=adaptive
 user=user password=password onError=skip /

 The jtds FAQ ( http://jtds.sourceforge.net/faq.html ) also has other
 configuration
 options, and more helpful information. For us, the transition was
 pretty painless.

 Regards,
 Gora

[sqljdbc4.jar] Errors

2011-01-04 Thread Adam Estrada

Can anyone help me with the following error. I upgraded my database to SQL
Server 2008 SP2 and now I get the following error. It was working with SQL
Server 2005.

Solr Error Stack
Caused by: java.lang.UnsupportedOperationException: Java Runtime Environment
(JR
E) version 1.6 is not supported by this driver. Use the sqljdbc4.jar class
libra
ry, which provides support for JDBC 4.0.

Any tips on this would be great!

Thanks,
Adam

Re: [sqljdbc4.jar] Errors

2011-01-04 Thread Adam Estrada

I got the latest jar file from the MS website and then changed the
authentication to Mixed Mode on my DB. That seems to have fixed it. My 2005
Server was Windows Authentication only and that worked so there are
obviously quite a few differences between the versions of the DB. I learn
something new every day

Thanks for the feedback!
Adam

On Tue, Jan 4, 2011 at 10:20 PM, Lance Norskog goks...@gmail.com wrote:

 Do you get a new JDBC driver jar with 2008? Look around the
 distribution or the MS web site.

 On Tue, Jan 4, 2011 at 7:06 PM, pankaj bhatt panbh...@gmail.com wrote:
  Hi Adam,
Can you try by downgrading your Java version to java 5.
  However i am using Java 6u13 with sqljdbc4.jar , i however do not
  get any error.
  If possible, can you pleease also try with some other version of
  Java 6.
 
  / Pankaj Bhatt.
 
  On Wed, Jan 5, 2011 at 5:01 AM, Adam Estrada
  estrada.adam.gro...@gmail.comwrote:
 
  Can anyone help me with the following error. I upgraded my database to
 SQL
  Server 2008 SP2 and now I get the following error. It was working with
 SQL
  Server 2005.
 
  Solr Error Stack
  Caused by: java.lang.UnsupportedOperationException: Java Runtime
  Environment
  (JR
  E) version 1.6 is not supported by this driver. Use the sqljdbc4.jar
 class
  libra
  ry, which provides support for JDBC 4.0.
 
  Any tips on this would be great!
 
  Thanks,
  Adam
 
 



 --
 Lance Norskog
 goks...@gmail.com

Re: [Nutch] and Solr integration

2011-01-03 Thread Adam Estrada

All,

I realize that the documentation says that you crawl first then add to Solr
but I spent several hours running the same command through Cygwin with
-solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
http://localhost:8983/solr) and it worked. Does anyone know why it's not
working for me anymore? I am using the Lucid build of Solr which was what i
was using before. I neglected to write down the command line syntax which is
biting me in the arse. Any tips on this one would be great!

Thanks,
Adam

On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: SpatialTierQueryParserPlugin Loading Error

2011-01-03 Thread Adam Estrada

No just yet, Grant...I have been sidetracked on a couple other things but I
will keep you posted.

Thanks for the response,
Adam

On Mon, Jan 3, 2011 at 10:22 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Sorry, I just saw this, Adam.  Were you able to get it working?

 On Dec 28, 2010, at 8:54 PM, Adam Estrada wrote:

  Hi Grant,
 
  I grabbed the latest version from trunk this morning and am still unable
 to
  get any of the spatial functionality to work. I still seem to be getting
 the
  class loading errors that I was getting when using the patches and jar
 files
  I found all over the web. What I really need at this point is an example
 of
  solrconfig.xml and whatever else I need to include to make it work
 properly.
  I am using the Geonames DB with valid lat/longs in decimal degrees so I'm
  confident that the data are correct. I have tried several examples all
 with
  the same results.
 
  There are other patches like the following that show snippets of how to
  modify the solrconfig file but there is no definitive source...
 
 
 https://issues.apache.org/jira/secure/attachment/12452781/SOLR-2077.Quach.Mattmann.082210.patch.txt
 
  I would gladly update this page if I could just get it working.
  http://wiki.apache.org/solr/SpatialSearch
 
  w/r,
  Adam
 
 
  On Tue, Dec 14, 2010 at 9:04 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
  For this functionality, you are probably better off using trunk or
  branch_3x.  There are quite a few patches related to that particular one
  that you will need to apply in order to have it work correctly.
 
 
  On Dec 13, 2010, at 10:06 PM, Adam Estrada wrote:
 
  All,
 
  Can anyone shed some light on this error. I can't seem to get this
  class to load. I am using the distribution of Solr from Lucid
  Imagination and the Spatial Plugin from here
  https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
  apply a patch but the jar file is in there. What else can I do?
 
  org.apache.solr.common.SolrException: Error loading class
  'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
   at
 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
   at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
   at
  org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
   at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:548)
   at
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
   at
  org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
   at
  org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
   at
 
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
   at
 org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
   at
 
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
   at
 
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
   at
  org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
   at
  org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
   at
 
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
   at
 
 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
   at
  org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
   at
 
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
   at
  org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
   at
 
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
   at org.mortbay.jetty.Server.doStart(Server.java:210)
   at
  org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
   at
 org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.mortbay.start.Main.invokeMain(Main.java:183)
   at org.mortbay.start.Main.start(Main.java:497)
   at org.mortbay.start.Main.main(Main.java:115)
  Caused by: java.lang.ClassNotFoundException:
  org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method

Re: [Nutch] and Solr integration

2011-01-03 Thread Adam Estrada

BLEH! facepalm This is entirely possible to do in a single step AS LONG AS
YOU GET THE SYNTAX CORRECT ;-)

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch
crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr*
http://localhost:8983/solr

http://localhost:8983/solrThe correct param is -solr NOT -solrindex.

Cheers,
Adam

On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada estrada.a...@gmail.comwrote:

 All,

 I realize that the documentation says that you crawl first then add to Solr
 but I spent several hours running the same command through Cygwin with
 -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch
 crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
 http://localhost:8983/solr) and it worked. Does anyone know why it's not
 working for me anymore? I am using the Lucid build of Solr which was what i
 was using before. I neglected to write down the command line syntax which is
 biting me in the arse. Any tips on this one would be great!

 Thanks,
 Adam

 On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine
 there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro
 Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for
 scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.

[DIH] and XML Namespaces

2010-12-29 Thread Adam Estrada

All,

I am indexing some RSS feeds that are bound to specific namespaces. See
below...

dataConfig
dataSource type=HttpDataSource
encoding=UTF-8
connectionTimeout=50
readTimeout=50/
  document
entity name=filedatasource
processor=FileListEntityProcessor

 baseDir=C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler
fileName=^.*xml$
recursive=true
rootEntity=false
dataSource=null

  entity name=CBP
pk=link
datasource=filedatasource
url=
http://ws.geonames.org/rssToGeoRSS?geoRSS=simpleamp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml

processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer,HTMLStripTransformer

field column=source   xpath=/rss/channel/title
commonField=true /
field column=source-link  xpath=/rss/channel/link
 commonField=true /
field column=subject  xpath=/rss/channel/description
commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
stripHTML=true /
field column=creator  xpath=/rss/channel/item/dc:creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' /
field column=dcdate   xpath=/rss/channel/item/dc:date
dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' /
field column=storexpath=/rss/channel/item/georss:point
/
  /entity

The process completely skips over any path with a colon in it.
ie. /rss/channel/item/georss:point.  Any ideas how to get around this using
the DIH?

Thanks to Chris Mattmann for the heads up on the geocoding services.

Adam

Re: [DIH] and XML Namespaces

2010-12-29 Thread Adam Estrada

Piece of cake!

http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example

http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleOur
XPath support has its limitations (no wildcards , only fullpath etc) but we
have tried to make sure that common use-cases are covered and since it's
based on a streaming parser, it is extremely fast and consumes constant
amount of memory even for large XMLs. It does not support namespaces , but
it can handle xmls with namespaces . When you provide the xpath, just drop
the namespace and give the rest (eg if the tag is 'dc:subject' the mapping
should just contain'subject').Easy, isn't it? And you didn't need to write
one line of code! Enjoy [image: :)]

On Wed, Dec 29, 2010 at 12:05 PM, Adam Estrada 
estrada.adam.gro...@gmail.com wrote:

 All,

 I am indexing some RSS feeds that are bound to specific namespaces. See
 below...

 dataConfig
 dataSource type=HttpDataSource
 encoding=UTF-8
 connectionTimeout=50
 readTimeout=50/
   document
 entity name=filedatasource
 processor=FileListEntityProcessor

  baseDir=C:/Apache/Solr-Nightly/solr/example/solr/conf/dataimporthandler
 fileName=^.*xml$
 recursive=true
 rootEntity=false
 dataSource=null

   entity name=CBP
 pk=link
 datasource=filedatasource
 url=
 http://ws.geonames.org/rssToGeoRSS?geoRSS=simpleamp;feedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xmlhttp://ws.geonames.org/rssToGeoRSS?geoRSS=simplefeedUrl=http://www.cbp.gov/xp/cgov/admin/rss/?rssUrl=/home.xml
 
 processor=XPathEntityProcessor
 forEach=/rss/channel | /rss/channel/item
 transformer=DateFormatTransformer,HTMLStripTransformer

 field column=source   xpath=/rss/channel/title
 commonField=true /
 field column=source-link  xpath=/rss/channel/link
  commonField=true /
 field column=subject  xpath=/rss/channel/description
 commonField=true /
 field column=titlexpath=/rss/channel/item/title /
 field column=link xpath=/rss/channel/item/link /
 field column=description  xpath=/rss/channel/item/description
 stripHTML=true /
 field column=creator  xpath=/rss/channel/item/dc:creator
 /
 field column=item-subject xpath=/rss/channel/item/subject /
 field column=author   xpath=/rss/channel/item/author /
 field column=comments xpath=/rss/channel/item/comments /
 field column=pubdate  xpath=/rss/channel/item/pubDate
 dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' /
 field column=dcdate   xpath=/rss/channel/item/dc:date
 dateTimeFormat=-MM-dd'T'HH:mm:ss'Z' /
 field column=storexpath=/rss/channel/item/georss:point
 /
   /entity

 The process completely skips over any path with a colon in it.
 ie. /rss/channel/item/georss:point.  Any ideas how to get around this using
 the DIH?

 Thanks to Chris Mattmann for the heads up on the geocoding services.

 Adam

Re: SpatialTierQueryParserPlugin Loading Error

2010-12-28 Thread Adam Estrada

Hi Grant,

I grabbed the latest version from trunk this morning and am still unable to
get any of the spatial functionality to work. I still seem to be getting the
class loading errors that I was getting when using the patches and jar files
I found all over the web. What I really need at this point is an example of
solrconfig.xml and whatever else I need to include to make it work properly.
I am using the Geonames DB with valid lat/longs in decimal degrees so I'm
confident that the data are correct. I have tried several examples all with
the same results.

There are other patches like the following that show snippets of how to
modify the solrconfig file but there is no definitive source...

https://issues.apache.org/jira/secure/attachment/12452781/SOLR-2077.Quach.Mattmann.082210.patch.txt

I would gladly update this page if I could just get it working.
http://wiki.apache.org/solr/SpatialSearch

w/r,
Adam


On Tue, Dec 14, 2010 at 9:04 AM, Grant Ingersoll gsing...@apache.orgwrote:

 For this functionality, you are probably better off using trunk or
 branch_3x.  There are quite a few patches related to that particular one
 that you will need to apply in order to have it work correctly.


 On Dec 13, 2010, at 10:06 PM, Adam Estrada wrote:

  All,
 
  Can anyone shed some light on this error. I can't seem to get this
  class to load. I am using the distribution of Solr from Lucid
  Imagination and the Spatial Plugin from here
  https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
  apply a patch but the jar file is in there. What else can I do?
 
  org.apache.solr.common.SolrException: Error loading class
  'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at
 org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
at org.apache.solr.core.SolrCore.init(SolrCore.java:548)
at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
 org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
  Caused by: java.lang.ClassNotFoundException:
  org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source

Re: [Import Timeout] using /dataimport

2010-12-24 Thread Adam Estrada

All,

That link is great but I am still getting timeout issues which causes the
entire import to fail. The feeds that are failing are like Newsweek and USA
Today which are very widely used. It's strange because sometimes they work
and sometimes they don't. I think that there are still timeout issues and
adding the params suggested in that article don't seem to fix it.

Adam

On Tue, Dec 21, 2010 at 8:04 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/12/22 9:35), Adam Estrada wrote:

 All,

 I've noticed that there are some RSS feeds that are slow to respond,
 especially during high usage times throughout the day. Is there a way to
 set
 the timeout to something really high or have it just wait until the feed
 is
 returned? The entire thing stops working when the feed doesn't respond.

 Your ideas are greatly appreciated.
 Adam

  readTimeout?

 http://wiki.apache.org/solr/DataImportHandler#Configuration_of_URLDataSource_or_HttpDataSource

 Koji
 --
 http://www.rondhuit.com/en/

Re: [Reload-Config] not working

2010-12-21 Thread Adam Estrada

I also noticed that when I run the config-reload command, the following
warning is thrown. I changed all my PK=id to see if that changed anything.
Anyone have any ideas why this is not working for me?

INFO: id is a required field in SolrSchema . But not found in DataConfig.

Regards,
Adm

On Mon, Dec 20, 2010 at 10:58 AM, Adam Estrada estrada.a...@gmail.comwrote:

 This is the response I get...Does it matter that the configuration file is
 called something other than data-config.xml? After I get this I still have
 to restart the service. I wonder...do I need to commit the change?

 ?xml version=1.0 encoding=UTF-8 ?
  
 -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
 response
  
 -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
 lst name=*responseHeader*
int name=*status*0/int
int name=*QTime*520/int
/lst
  
 -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
 lst name=*initArgs*
  
 -http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
 lst name=*defaults*
str name=*config*./solr/conf/dataimporthandler/rss.xml/str
   /lst
   /lst
str name=*command*reload-config/str
str name=*status*idle/str
str name=*importResponse*Configuration Re-loaded sucessfully/str
lst name=*statusMessages* /
str name=*WARNING*This response format is experimental. It is
 likely to change in the future./str
   /response



 On Sun, Dec 19, 2010 at 11:12 PM, Ahmet Arslan iori...@yahoo.com wrote:

  a href=
 
 http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import
 Full
  Import/abr /
  a href=
 
 http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config
 Reload
  Configuration/a
 
  All,
 
  The links above are meant for me to reload the
  configuration file after a
  change is made and the other is to perform the full import.
  My problem is
  that The reload-config option does not seem to be working.
  Am I doing
  anything wrong? Your expertise is greatly appreciated!

 I am sorry, I hit the reply button accidentally.

 Are you receiving/checking the message
 str name=importResponseConfiguration Re-loaded sucessfully/str
 after the reload?

 And are checking that data-config.xml is a valid xml after editing it
 programatically?

 And instead of editing data-config.xml file cant you use  variable
 resolver? http://search-lucene.com/m/qYzPk2n86iIsubj

[Import Timeout] using /dataimport

2010-12-21 Thread Adam Estrada

All,

I've noticed that there are some RSS feeds that are slow to respond,
especially during high usage times throughout the day. Is there a way to set
the timeout to something really high or have it just wait until the feed is
returned? The entire thing stops working when the feed doesn't respond.

Your ideas are greatly appreciated.
Adam

Re: [Reload-Config] not working

2010-12-20 Thread Adam Estrada

This is the response I get...Does it matter that the configuration file is
called something other than data-config.xml? After I get this I still have
to restart the service. I wonder...do I need to commit the change?

?xml version=1.0 encoding=UTF-8 ?
 
-http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
response
 
-http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
lst name=*responseHeader*
   int name=*status*0/int
   int name=*QTime*520/int
  /lst
 
-http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
lst name=*initArgs*
 
-http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config#
lst name=*defaults*
   str name=*config*./solr/conf/dataimporthandler/rss.xml/str
  /lst
  /lst
   str name=*command*reload-config/str
   str name=*status*idle/str
   str name=*importResponse*Configuration Re-loaded sucessfully/str
   lst name=*statusMessages* /
   str name=*WARNING*This response format is experimental. It is likely
to change in the future./str
  /response



On Sun, Dec 19, 2010 at 11:12 PM, Ahmet Arslan iori...@yahoo.com wrote:

  a href=
 
 http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import
 Full
  Import/abr /
  a href=
 
 http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config
 Reload
  Configuration/a
 
  All,
 
  The links above are meant for me to reload the
  configuration file after a
  change is made and the other is to perform the full import.
  My problem is
  that The reload-config option does not seem to be working.
  Am I doing
  anything wrong? Your expertise is greatly appreciated!

 I am sorry, I hit the reply button accidentally.

 Are you receiving/checking the message
 str name=importResponseConfiguration Re-loaded sucessfully/str
 after the reload?

 And are checking that data-config.xml is a valid xml after editing it
 programatically?

 And instead of editing data-config.xml file cant you use  variable
 resolver? http://search-lucene.com/m/qYzPk2n86iIsubj

[Nutch] and Solr integration

2010-12-20 Thread Adam Estrada

All,

I have a couple websites that I need to crawl and the following command line
used to work I think. Solr is up and running and everything is fine there
and I can go through and index the site but I really need the results added
to Solr after the crawl. Does anyone have any idea on how to make that
happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
which I am not using at all.

$ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
ht
tp://localhost:8983/solr
crawl started in: crawl
rootUrlDir = http://localhost:8983/solr
threads = 10
depth = 100
indexer=lucene
topN = 50
Injector: starting at 2010-12-20 15:23:25
Injector: crawlDb: crawl/crawldb
Injector: urlDir: http://localhost:8983/solr
Injector: Converting injected urls to crawl db entries.
Exception in thread main java.io.IOException: No FileSystem for scheme:
http
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
ava:169)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
81)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Re: [Nutch] and Solr integration

2010-12-20 Thread Adam Estrada

bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex
http://localhost:8983/solr

I've run that command before and it worked...that's why I asked.

grab nutch from trunk and run bin/nutch and see that it is in fact an
option. It looks like Hadoop is the culprit now and I am at a loss on how to
fix it.

Thanks for the feedback.
Adam

On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:


 why are using solrindex in the argument.? It is used when we need to index
 the crawled data in Solr
 For more read http://wiki.apache.org/nutch/NutchTutorial .

 Also for nutch-solr integration this is very useful blog
 http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 I integrated nutch and solr and it works well.

 Thanks

 On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] 
 ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com
 ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com
 
  wrote:

  All,
 
  I have a couple websites that I need to crawl and the following command
  line
  used to work I think. Solr is up and running and everything is fine there
  and I can go through and index the site but I really need the results
 added
 
  to Solr after the crawl. Does anyone have any idea on how to make that
  happen or what I'm doing wrong?  These errors are being thrown fro Hadoop
  which I am not using at all.
 
  $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50
  -solrindex
  ht
  tp://localhost:8983/solr
  crawl started in: crawl
  rootUrlDir = http://localhost:8983/solr
  threads = 10
  depth = 100
  indexer=lucene
  topN = 50
  Injector: starting at 2010-12-20 15:23:25
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: http://localhost:8983/solr
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: No FileSystem for scheme:
  http
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375
  )
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
  at
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j
  ava:169)
  at
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
  va:201)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
  81)
  at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
 
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 
 
  --
   View message @
 
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html
  To start a new topic under Solr - User, email
  ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com
 ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com
 
  To unsubscribe from Solr - User, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=
 .
 
 



 --
 Kumar Anurag


 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html
 Sent from the Solr - User mailing list archive at Nabble.com.

[Reload-Config] not working

2010-12-19 Thread Adam Estrada

a href=
http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=full-import;Full
Import/abr /
a href=
http://localhost:8983/solr/select?clean=falsecommit=trueqt=%2Fdataimportcommand=reload-config;Reload
Configuration/a

All,

The links above are meant for me to reload the configuration file after a
change is made and the other is to perform the full import. My problem is
that The reload-config option does not seem to be working. Am I doing
anything wrong? Your expertise is greatly appreciated!

Adam

Re: indexing a lot of XML dokuments

2010-12-16 Thread Adam Estrada

I have been very successful in following this example
http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example

http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleAdam

On Thu, Dec 16, 2010 at 5:44 AM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 hi, users, i serch e way to indexing a lot of iml Dokuments so fast as
 Possible.

 i have more than 1 million docs on Server 1 and a SolR multicor an Server 2
 with tomcat.

 i dont know ho i can do it easy and fast..

 I cant find a idea in the wiki, maby you have some ideas?

 King

Re: bulk commits

2010-12-16 Thread Adam Estrada

what is it that you are trying to commit?

a

On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 What have people found as the best way to do bulk commits either from the
 web or
 from a file on the system?

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.

Re: bulk commits

2010-12-16 Thread Adam Estrada

,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xam.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xan.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xao.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xap.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl http://localhost:8983/solr/update -H Content-Type: text/xml
--data-binary 'optimize/'

Adam

On Thu, Dec 16, 2010 at 1:44 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 Might be Csv or tab delimited text.

 Sent from Yahoo! Mail on Android

  --
 * From: * Adam Estrada estrada.adam.gro...@gmail.com;
 * To: * solr-user@lucene.apache.org;
 * Subject: * Re: bulk commits
 * Sent: * Thu, Dec 16, 2010 6:35:17 PM

   what is it that you are trying to commit?

 a

 On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.net
 wrote:

  What have people found as the best way to do bulk commits either from the
  web or
  from a file on the system?
 
   Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
  better
  idea to learn from others’ mistakes, so you do not have to make them
  yourself.
  from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.

Re: bulk commits

2010-12-16 Thread Adam Estrada

One very important thing I forgot to mention is that you will have to
increase the JAVA heap size for larger data sets.

Set JAVA_OPT to something acceptable.

Adam

On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net
 wrote:
  That easy, huh? Heck, this gets better and better.
 
  BTW, how about escaping?

 The CSV escaping?  It's configurable to allow for loading different
 CSV dialects.

 http://wiki.apache.org/solr/UpdateCSV

 By default it uses double quote encapsulation, like excel would.
 The bottom of the wiki page shows how to configure tab separators and
 backslash escaping like MySQL produces by default.

 -Yonik
 http://www.lucidimagination.com


 
   Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
 better
  idea to learn from others’ mistakes, so you do not have to make them
 yourself.
  from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.
 
 
 
  - Original Message 
  From: Adam Estrada estrada.adam.gro...@gmail.com
  To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org
  Sent: Thu, December 16, 2010 10:58:47 AM
  Subject: Re: bulk commits
 
  This is how I import a lot of data from a cvs file. There are close to
 100k
  records in there. Note that you can either pre-define the column names
 using
  the fieldnames param like I did here *or* include header=true which will
  automatically pick up the column header if your file has it.
 
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
 
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 
  This seems to load everything in to some kind of temporary location
 before
  it's actually committed. If something goes wrong there is a rollback
 feature
  that will undo anything that happened before the commit.
 
  As far as batching a bunch of files, I copied and pasted the following in
 to
  Cygwin and it worked just fine.
 
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
 
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr

Re: [DIH] Example for SQL Server

2010-12-15 Thread Adam Estrada

Thanks All,

Testing here shortly and will report back asap.

w/r,
Adam

On Wed, Dec 15, 2010 at 4:10 AM, Savvas-Andreas Moysidis 
savvas.andreas.moysi...@googlemail.com wrote:

 Hi Adam,

 we are using DIH to index off an SQL Server database(the freeby SQLExpress
 one.. ;) ). We have defined the following in our
 %TOMCAT_HOME%\solr\conf\data-config.xml:
 dataConfig

  dataSource type=JdbcDataSource
  name=mssqlDatasource
  driver=net.sourceforge.jtds.jdbc.Driver
  url=jdbc:jtds:sqlserver://{server.name
 }:{server.port}/{dbInstanceName};instance=SQLEXPRESS
  convertType=true
  user={user.name}
  password={user.password}/
  document
entity name=id
 dataSource=mssqlDatasource
   query=your query here /
  /document
 /dataConfig

 We downloaded a JDBC driver from here http://jtds.sourceforge.net/faq.htmland
 found it to be a quite stable driver.

 And the only thing we really had to do was drop that library in
 %TOMCAT_HOME%\lib directory (for Tomcat 6+).

 Hope that helps.
 -- Savvas.

 On 14 December 2010 22:46, Erick Erickson erickerick...@gmail.com wrote:

  The config isn't really any different for various sql instances, about
 the
  only difference is the driver. Have you seen the example in the
  distribution somewhere like
  solr_home/example/example-DIH/solr/db/conf/db-data-config.xml?
 
  Also, there's a magic URL for debugging DIH at:
  .../solr/admin/dataimport.jsp
 
  If none of that is useful, could you post your attempt and maybe someone
  can
  offer some hints?
 
  Best
  Erick
 
  On Tue, Dec 14, 2010 at 5:32 PM, Adam Estrada 
  estrada.adam.gro...@gmail.com
   wrote:
 
   Does anyone have an example config.xml file I can take a look at for
 SQL
   Server? I need to index a lot of data from a DB and can't seem to
 figure
   out
   the right syntax so any help would be greatly appreciated. What is the
   correct /jar file to use and where do I put it in order for it to work?
  
   Thanks,
   Adam

Re: Dataimport performance

2010-12-15 Thread Adam Estrada

What version of Solr are you using?

Adam

2010/12/15 Robert Gründler rob...@dubture.com

 Hi,

 we're looking for some comparison-benchmarks for importing large tables
 from a mysql database (full import).

 Currently, a full-import of ~ 8 Million rows from a MySQL database takes
 around 3 hours, on a QuadCore Machine with 16 GB of
 ram and a Raid 10 storage setup. Solr is running on a apache tomcat
 instance, where it is the only app. The tomcat instance
 has the following memory-related java_opts:

 -Xms4096M -Xmx5120M


 The data-config.xml looks like this (only 1 entity):

  entity name=track query=select t.id as id, t.title as title,
 l.title as label from track t left join label l on (l.id = t.label_id)
 where t.deleted = 0 transformer=TemplateTransformer
field column=title name=title_t /
field column=label name=label_t /
field column=id name=sf_meta_id /
field column=metaclass template=Track name=sf_meta_class/
field column=metaid template=${track.id} name=sf_meta_id/
field column=uniqueid template=Track_${track.id}
 name=sf_unique_id/

entity name=artists query=select a.name as artist from artist a
 left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${
 track.id}
  field column=artist name=artists_t /
/entity

  /entity


 We have the feeling that 3 hours for this import is quite long - regarding
 the performance of the server running solr/mysql.

 Are we wrong with that assumption, or do people experience similar import
 times with this amount of data to be imported?


 thanks!


 -robert

[Adding] Entities when indexing a DB

2010-12-15 Thread Adam Estrada

All,

I have successfully indexed a single entity but when I try multiple entities
is the second is skipped all together. Is there something wrong with my
config file?

?xml version=1.0 encoding=utf-8 ?
dataConfig
  dataSource type=JdbcDataSource
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
   url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV
   user=adam
   password=password/
  document name=events
entity datasource=MISSIONS
query = SELECT IdMission AS id,
CoreGroup AS cat,
StrMissionname AS subject,
strDescription AS description,
DateCreated AS pubdate
FROM dbo.tblMission
  field column=id name=id /
  field column=cat name=cat /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity
entity datasource=EVENTS
 query = SELECT strsubject AS subject,
strsummary as description,
datecreated as date,
CoreGroup as cat,
idevent as id
FROM dbo.tblEvent
  field column=id name=id /
  field column=cat name=cat /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity
  /document
/dataConfig

Re: [Adding] Entities when indexing a DB

2010-12-15 Thread Adam Estrada

Ahhh...I found that I did not set a dataSource name and when I did that and
then referred each entity to that dataSource all went according to plan ;-)

?xml version=1.0 encoding=utf-8 ?
dataConfig
  dataSource type=JdbcDataSource
   name=bleh
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
   url=jdbc:sqlserver://server;databaseName=50_DEV
   user=adam
   password=pw/
  document
entity name=Missions dataSource=bleh
query = SELECT (IdMission + 100) AS id,
idMission as missionid,
CoreGroup AS cat,
StrMissionname AS subject,
strDescription AS description,
DateCreated AS pubdate,
'Mission' AS cat2
FROM dbo.tblMission
  field column=id name=id /
  field column=missionid name=missionid /
  field column=cat name=cat /
  field column=cat2 name=cat2 /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity

entity name=Events dataSource=bleh
 query = SELECT strsubject AS subject,
strsummary as description,
datecreated as date,
CoreGroup as cat,
idevent as id,
'Event' AS cat2,
IdEvent AS missionid
FROM dbo.tblEvent
  field column=id name=id /
  field column=missionid name=missionid /
  field column=cat name=cat /
  field column=cat2 name=cat2 /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity
  /document
/dataConfig

Solr Rocks!
Adam



On Wed, Dec 15, 2010 at 3:53 PM, Allistair Crossley a...@roxxor.co.ukwrote:

 mission.id and event.id if the same value will be overwriting the indexed
 document. your ids need to be unique across all documents. i usually have a
 field id_original that i map the table id to, and then for id per entity i
 usually prefix it with the entity name in the value mapped to the schema id
 field

 On 15 Dec 2010, at 20:49, Adam Estrada wrote:

  All,
 
  I have successfully indexed a single entity but when I try multiple
 entities
  is the second is skipped all together. Is there something wrong with my
  config file?
 
  ?xml version=1.0 encoding=utf-8 ?
  dataConfig
   dataSource type=JdbcDataSource
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV
user=adam
password=password/
   document name=events
 entity datasource=MISSIONS
 query = SELECT IdMission AS id,
 CoreGroup AS cat,
 StrMissionname AS subject,
 strDescription AS description,
 DateCreated AS pubdate
 FROM dbo.tblMission
   field column=id name=id /
   field column=cat name=cat /
   field column=subject name=subject /
   field column=description name=description /
   field column=pubdate name=date /
 /entity
 entity datasource=EVENTS
  query = SELECT strsubject AS subject,
 strsummary as description,
 datecreated as date,
 CoreGroup as cat,
 idevent as id
 FROM dbo.tblEvent
   field column=id name=id /
   field column=cat name=cat /
   field column=subject name=subject /
   field column=description name=description /
   field column=pubdate name=date /
 /entity
   /document
  /dataConfig

Thank you!

2010-12-15 Thread Adam Estrada

I just want to say that this list serve has been invaluable to a newbie like
me ;-) I posted a question earlier today and literally 10 minutes later I
got an answer that helped me solve my problem. This is proof that there is a
experienced and energetic community behind this FOSS group of projects and I
really appreciate everyone who has put up with my otherwise trivial
questions!  More importantly, thanks to all of the contributors who make the
whole thing possible!  I attended the Lucene Revolution conference in Boston
this year and the information that I was able to take away from the whole
thing has made me and my vocation a lot more valuable. Keep up the
outstanding work in the discovery of useful information from a sea of bleh
;-)

Kindest regards,
Adam

[DIH] Example for SQL Server

2010-12-14 Thread Adam Estrada

Does anyone have an example config.xml file I can take a look at for SQL
Server? I need to index a lot of data from a DB and can't seem to figure out
the right syntax so any help would be greatly appreciated. What is the
correct /jar file to use and where do I put it in order for it to work?

Thanks,
Adam

Re: [pubDate] is not converting correctly

2010-12-13 Thread Adam Estrada

+1  If I knew enough about how to do this in Java I would but I do not
s.What is the correct way to add or suggest enhancements to Solr
core?

Adam

On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog goks...@gmail.com wrote:

 Nice find!  This is Apache 2.0, copyright SUN.

 O Great Apache Elders: Is it kosher to add this to the Solr
 distribution? It's not in the JDK and is also com.sun.*

 On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  Thanks for the feedback! There are quite a few formats that can be used.
 I
  am experiencing at least 5 of them. Would something like this work? Note
  that there are 2 different formats separated by a comma.
 
  field column=pubdate xpath=/rss/channel/item/pubDate
  dateTimeFormat=EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'
 /
 
  I don't suppose it will because there is already a comma in the first
  parser. I guess I am reallly looking for an all purpose data time parser
 but
  even if I have that, would I still be able to query *all* fields in the
  index?
 
  Good article:
 
 http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
 
  Adam
 
  On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp
 wrote:
 
  (10/12/13 8:49), Adam Estrada wrote:
 
  All,
 
  I am having some difficulties parsing the pubDate field that is part
 of
  the?
  RSS spec (I believe). I get the warning that states, Dec 12, 2010
  6:45:26
  PM org.apache.solr.handler.dataimport.DateFormatTransformer
   transformRow
  WARNING: Could not parse a Date field
  java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43
  +
  at java.text.DateFormat.parse(Unknown Source)
 
  Does anyone know how to fix this? I would eventually like to do a date
  query
  but without the ability to properly parse them I don't know if it's
 going
  to
  work.
 
  Thanks,
  Adam
 
 
  Adam,
 
  How does your data-config.xml look like for that field?
  Have you looked at rss-data-config.xml file
  under example/example-DIH/solr/rss/conf directory?
 
  Koji
  --
  http://www.rondhuit.com/en/
 
 



 --
 Lance Norskog
 goks...@gmail.com

Re: Indexing pdf files - question.

2010-12-13 Thread Adam Estrada

Hi,

I use the following command to post PDF files.

$ curl http://localhost:8983/solr/update/extract?stream.file=C
:\temp\document.docxstream.contentType=application/mswordliteral.id
=esc.doccommit=true
$ curl http://localhost:8983/solr/update/extract?stream.file=C
:\temp\features.pdfstream.contentType=application/pdfliteral.id
=esc2.doccommit=true
$ curl http://localhost:8983/solr/update/extract?stream.file=C
:\temp\Memo_ocrd.pdfstream.contentType=application/pdfliteral.id
=Memo_ocrd.pdfdefaultField=textcommit=true

The PDF's have to be OCR'd.

Adam

On Mon, Dec 13, 2010 at 11:01 AM, Siebor, Wlodek [USA] 
siebor_wlo...@bah.com wrote:

 HI,
 Can sombody, please, send me a command for indexing a sample pdf with
 ExtractngRequestHandler file available in the /docs directory. I have
 lucidworks solr installed on linux, with standard schema.xml and
 solrconfig.xml files (unchanged). I want to pass as the unique id the name
 of the file.
 I’m trying various curl commands and so far I have either  “… missing
 required field: id” or “.. missing content stream” errors.
 Thanks for your help,
 Wlodek

Re: [pubDate] is not converting correctly

2010-12-13 Thread Adam Estrada

My first submission ;-)

https://issues.apache.org/jira/browse/SOLR-2286

https://issues.apache.org/jira/browse/SOLR-2286Adam

On Mon, Dec 13, 2010 at 5:14 PM, Lance Norskog goks...@gmail.com wrote:

 Create an account at
 https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create
 New Issue' for the Solr project.

 On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog goks...@gmail.com wrote:
  Please file a JIRA requesting this.
 
  On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada estrada.a...@gmail.com
 wrote:
  +1  If I knew enough about how to do this in Java I would but I do not
  s.What is the correct way to add or suggest enhancements to Solr
  core?
 
  Adam
 
  On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog goks...@gmail.com
 wrote:
 
  Nice find!  This is Apache 2.0, copyright SUN.
 
  O Great Apache Elders: Is it kosher to add this to the Solr
  distribution? It's not in the JDK and is also com.sun.*
 
  On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
  estrada.adam.gro...@gmail.com wrote:
   Thanks for the feedback! There are quite a few formats that can be
 used.
  I
   am experiencing at least 5 of them. Would something like this work?
 Note
   that there are 2 different formats separated by a comma.
  
   field column=pubdate xpath=/rss/channel/item/pubDate
   dateTimeFormat=EEE, dd MMM  HH:mm:ss zzz,
 -MM-dd'T'HH:mm:ss'Z'
  /
  
   I don't suppose it will because there is already a comma in the first
   parser. I guess I am reallly looking for an all purpose data time
 parser
  but
   even if I have that, would I still be able to query *all* fields in
 the
   index?
  
   Good article:
  
 
 http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
  
   Adam
  
   On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp
  wrote:
  
   (10/12/13 8:49), Adam Estrada wrote:
  
   All,
  
   I am having some difficulties parsing the pubDate field that is
 part
  of
   the?
   RSS spec (I believe). I get the warning that states, Dec 12, 2010
   6:45:26
   PM org.apache.solr.handler.dataimport.DateFormatTransformer
transformRow
   WARNING: Could not parse a Date field
   java.text.ParseException: Unparseable date: Thu, 30 Jul 2009
 14:41:43
   +
   at java.text.DateFormat.parse(Unknown Source)
  
   Does anyone know how to fix this? I would eventually like to do a
 date
   query
   but without the ability to properly parse them I don't know if it's
  going
   to
   work.
  
   Thanks,
   Adam
  
  
   Adam,
  
   How does your data-config.xml look like for that field?
   Have you looked at rss-data-config.xml file
   under example/example-DIH/solr/rss/conf directory?
  
   Koji
   --
   http://www.rondhuit.com/en/
  
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



 --
 Lance Norskog
 goks...@gmail.com

SpatialTierQueryParserPlugin Loading Error

2010-12-13 Thread Adam Estrada

All,

Can anyone shed some light on this error. I can't seem to get this
class to load. I am using the distribution of Solr from Lucid
Imagination and the Spatial Plugin from here
https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
apply a patch but the jar file is in there. What else can I do?

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
at org.apache.solr.core.SolrCore.init(SolrCore.java:548)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
... 33 more

Re: SOLR geospatial

2010-12-12 Thread Adam Estrada

I am particularly interested in storing and querying polygons. That sort of
thing looks like its on their roadmap so does anyone know what the status is
on that? Also, integration with JTS would make this a core component of any
GIS. Again, anyone know what the status is on that?

*What’s on the roadmap of future features?*

Here are some of the features and henhancements we're planning for SSP:

   -

   Performance improvements for larger data sets
   -

   Fixing of known bugs
   -

   Distance facets: Allowing Solr users to be able to filter their results
   based on the calculated distances.
   -

   Search with regular polygons, and groups of shapes
   -

   Integration with JTS
   -

   Highly optimized distance calculation algorithms
   -

   Ranking results by distance
   -

   3D dimension search


Adam

On Sun, Dec 12, 2010 at 12:01 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 That smells like: http://www.jteam.nl/news/spatialsolr.html

  My partner is using a publicly available plugin for GeoSpatial. It is
 used
  both during indexing and during search. It forms some kind of gridding
  system and puts 10 fields per row related to that. Doing a Radius search
  (vs a bounding box search which is faster in almost all cases in all
  GeoSpatial query systems) seems pretty fast. GeoSpatial was our project's
  constraint. We've moved past that now.
 
  Did I mention that it returns distance from the center of the radius
 based
  on units supplied in the query?
 
  I would tell you what the plugin is, but in our division of labor, I have
  kept that out of my short term memory. You can contact him at:
  Danilo Unite danilo.un...@gmail.com;
 
  Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
  better idea to learn from others’ mistakes, so you do not have to make
  them yourself. from
  'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.
 
 
 
  - Original Message 
  From: George Anthony pa...@rogers.com
  To: solr-user@lucene.apache.org
  Sent: Fri, December 10, 2010 9:23:18 AM
  Subject: SOLR geospatial
 
  In looking at some of the docs support for geospatial search.
 
  I see this functionality is mostly scheduled for upcoming release 4.0
 (with
  some
 
  playing around with backported code).
 
 
  I note the support for the bounding box filter, but will bounding box
 be
  one of the supported *data* types for use with this filter?  For example,
  if my lat/long data describes the footprint of a map, I'm curious if
  that type of coordinate data can be used by the bounding box filter (or
 in
  any other way for similar limiting/filtering capability). I see it can
  work with point type data but curious about functionality with bounding
  box type data (in contrast to simple point lat/long data).
 
  Thanks,
  George

Re: SOLR geospatial

2010-12-12 Thread Adam Estrada

I would be more than happy to help with any of the spatial testing you are
working on.

adam

On Sun, Dec 12, 2010 at 3:08 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 We're in Alpha, heading to Alpha 2. Our requirements are simple: radius
 searching, and distance from center. Solr Spatial works and is current.
 GeoSpatial is almost there, but we're going to wait until it's released to
 spend
 time with it. We have other tasks to work on and don't want to be part of
 the
 debugging process of any project right now.

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sun, December 12, 2010 11:18:03 AM
 Subject: Re: SOLR geospatial

 By and large, spatial solr is being replaced by geospatial, see:
 http://wiki.apache.org/solr/SpatialSearch. I don't think the old
 spatial contrib is still included in the trunk or 3.x code bases, but
 I could be wrong

 That said, I don't know whether what you want is on the roadmap
 there either. Here's a place to start if you want to see the JIRA
 discussions: https://issues.apache.org/jira/browse/SOLR-1568

 Best
 Erick


 On Sun, Dec 12, 2010 at 11:23 AM, Adam Estrada estrada.a...@gmail.com
 wrote:

  I am particularly interested in storing and querying polygons. That sort
 of
  thing looks like its on their roadmap so does anyone know what the status
  is
  on that? Also, integration with JTS would make this a core component of
 any
  GIS. Again, anyone know what the status is on that?
 
  *What’s on the roadmap of future features?*
 
  Here are some of the features and henhancements we're planning for SSP:
 
-
 
Performance improvements for larger data sets
-
 
Fixing of known bugs
-
 
Distance facets: Allowing Solr users to be able to filter their results
based on the calculated distances.
-
 
Search with regular polygons, and groups of shapes
-
 
Integration with JTS
-
 
Highly optimized distance calculation algorithms
-
 
Ranking results by distance
-
 
3D dimension search
 
 
  Adam
 
  On Sun, Dec 12, 2010 at 12:01 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   That smells like: http://www.jteam.nl/news/spatialsolr.html
  
My partner is using a publicly available plugin for GeoSpatial. It is
   used
both during indexing and during search. It forms some kind of
 gridding
system and puts 10 fields per row related to that. Doing a Radius
  search
(vs a bounding box search which is faster in almost all cases in all
GeoSpatial query systems) seems pretty fast. GeoSpatial was our
  project's
constraint. We've moved past that now.
   
Did I mention that it returns distance from the center of the radius
   based
on units supplied in the query?
   
I would tell you what the plugin is, but in our division of labor, I
  have
kept that out of my short term memory. You can contact him at:
Danilo Unite danilo.un...@gmail.com;
   
Dennis Gearon
   
   
Signature Warning

It is always a good idea to learn from your own mistakes. It is
 usually
  a
better idea to learn from others’ mistakes, so you do not have to
 make
them yourself. from
'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
   
   
EARTH has a Right To Life,
otherwise we all die.
   
   
   
- Original Message 
From: George Anthony pa...@rogers.com
To: solr-user@lucene.apache.org
Sent: Fri, December 10, 2010 9:23:18 AM
Subject: SOLR geospatial
   
In looking at some of the docs support for geospatial search.
   
I see this functionality is mostly scheduled for upcoming release 4.0
   (with
some
   
playing around with backported code).
   
   
I note the support for the bounding box filter, but will bounding
 box
   be
one of the supported *data* types for use with this filter?  For
  example,
if my lat/long data describes the footprint of a map, I'm curious
 if
that type of coordinate data can be used by the bounding box filter
 (or
   in
any other way for similar limiting/filtering capability). I see it
 can
work with point type data but curious about functionality with
 bounding
box type data (in contrast to simple point lat/long data).
   
Thanks,
George

Re: [Multiple] RSS Feeds at a time...

2010-12-12 Thread Adam Estrada

Hi Ahmet,

This is a great idea but still does not appear to be working correctly. The
idea is that I want to be able to add an RSS feed and then index that feed
on a schedule. My C# method looks something like this.

public ActionResult Index()
{
try {
HTTPGet req = new HTTPGet();
string solrStr =
System.Configuration.ConfigurationManager.AppSettings[solrUrl].ToString();
req.Request(solrStr +
/select?clean=truecommit=trueqt=/dataimportcommand=reload-config);
req.Request(solrStr +
/select?clean=falsecommit=trueqt=/dataimportcommand=full-import);
Response.Write(req.StatusLine);
Response.Write(req.ResponseTime);
Response.Write(req.StatusCode);
return RedirectToAction(../Import/Feeds);
//return View();
} catch (SolrConnectionException) {
throw new Exception(string.Format(Couldn't Import RSS
Feeds));
}
}

My XML configuration file looks somethiing like this...

dataConfig
dataSource type=HttpDataSource /
  document
entity name=filedatasource
processor=FileListEntityProcessor
baseDir=./solr/conf/dataimporthandler
fileName=^.*xml$
recursive=true
rootEntity=false
dataSource=null

  entity name=cnn
  pk=link
  datasource=filedatasource
  url=http://rss.cnn.com/rss/cnn_topstories.rss;
  processor=XPathEntityProcessor
  forEach=/rss/channel | /rss/channel/item
  transformer=DateFormatTransformer,HTMLStripTransformer

field column=source   xpath=/rss/channel/title
commonField=true /
field column=source-link  xpath=/rss/channel/link
 commonField=true /
field column=subject  xpath=/rss/channel/description
commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
stripHTML=true /
field column=creator  xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
  /entity

  entity name=newsweek
pk=link
datasource=filedatasource
url=http://feeds.newsweek.com/newsweek/nation;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer,HTMLStripTransformer

field column=source   xpath=/rss/channel/title
commonField=true /
field column=source-link  xpath=/rss/channel/link
 commonField=true /
field column=subject  xpath=/rss/channel/description
commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
stripHTML=true /
field column=creator  xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z'/
  /entity
   /entity
  /document
/dataConfig

As you can see, I can add sub-entities from what appears to be as many times
as I want. The idea was to reload the xml file after each entity is added.
What else am I missing here because the reload-config command does not seem
to be working. Any ideas would be great!

Thanks,
Adam Estrada

On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan iori...@yahoo.com wrote:

  I found that you can have a single config file that can
  have several
  entities in it. My question now is how can I add entities
  without restarting
  the Solr service?

 You mean changing and re-loading xml config file?

 dataimport?command=reload-config
 http://wiki.apache.org/solr/DataImportHandler#Commands

[pubDate] is not converting correctly

2010-12-12 Thread Adam Estrada

All,

I am having some difficulties parsing the pubDate field that is part of the
RSS spec (I believe). I get the warning that states, Dec 12, 2010 6:45:26
PM org.apache.solr.handler.dataimport.DateFormatTransformer
 transformRow
WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43
+
at java.text.DateFormat.parse(Unknown Source)

Does anyone know how to fix this? I would eventually like to do a date query
but without the ability to properly parse them I don't know if it's going to
work.

Thanks,
Adam

Re: [pubDate] is not converting correctly

2010-12-12 Thread Adam Estrada

Thanks for the feedback! There are quite a few formats that can be used. I
am experiencing at least 5 of them. Would something like this work? Note
that there are 2 different formats separated by a comma.

field column=pubdate xpath=/rss/channel/item/pubDate
dateTimeFormat=EEE, dd MMM HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z' /

I don't suppose it will because there is already a comma in the first
parser. I guess I am reallly looking for an all purpose data time parser but
even if I have that, would I still be able to query *all* fields in the
index?

Good article:
http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm

Adam

On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

(10/12/13 8:49), Adam Estrada wrote:

All,

I am having some difficulties parsing the pubDate field that is part of
the?
RSS spec (I believe). I get the warning that states, Dec 12, 2010
6:45:26
PM org.apache.solr.handler.dataimport.DateFormatTransformer
transformRow
WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: Thu, 30 Jul 2009 14:41:43
+
at java.text.DateFormat.parse(Unknown Source)

Does anyone know how to fix this? I would eventually like to do a date
query
but without the ability to properly parse them I don't know if it's going
to
work.

Thanks,
Adam

Adam,

How does your data-config.xml look like for that field?
Have you looked at rss-data-config.xml file
under example/example-DIH/solr/rss/conf directory?

Koji
--
http://www.rondhuit.com/en/

Re: Indexing documents with SOLR

2010-12-11 Thread Adam Estrada

Pankaj,

Check this article out on how to get going with Nutch.
http://bit.ly/dbBdK4This is a few months old so you will have to note
that there is a new
parameter called something like -SolrUrl that will allow you to update your
solr index with the crawled data.

For crawling your local file system, you will have to change the http:// to
file:// in your seed.txt file to point to the directory you want to crawl.
Another VERY important option is to increase your Java heap size. I do this
by using the JAVA_OPT environment variable.

Adam

On Sat, Dec 11, 2010 at 8:27 AM, pankaj bhatt panbh...@gmail.com wrote:

 Hi Adam,
Thanks a lot for pointing me out to NUTCH.
Can you please tell me, is through NUTCH Can I read teh directory on
 local system or on a shared file system.

   Will wait for your response.

 / Pankaj Bhatt


 On Fri, Dec 10, 2010 at 9:35 PM, Adam Estrada estrada.a...@gmail.comwrote:

 Nutch is also a great option if you want a crawler. I have found that you
 will need to use the latest version of PDFBox and a it's dependencies for
 better results. Also, make sure to set JAVA_OPT to something really large
 so
 that you won't exceed your heap size.

 Adam

 On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili
 tommaso.teof...@gmail.comwrote:

  Hi Pankaj,
  you can find the needed documentation right here [1].
  Hope this helps,
  Tommaso
 
  [1] : http://wiki.apache.org/solr/ExtractingRequestHandler
 
  2010/12/10 pankaj bhatt panbh...@gmail.com
 
   Hi All,
I am a newbie to SOLR and trying to integrate TIKA + SOLR.
Can anyone please guide me, how to achieve this.
  
   * My Req is:* I have a directory containing a lot of PDF,DOC's and i
 need
   to
   make a search within the documents. I am using SOLR web application.
  
 I just need some sample xml code both for solr-config.xml
 and
  the
   directory-schema.xml
  Awaiting eagerly for your response.
  
   Regards,
   Pankaj Bhatt.

Re: [Multiple] RSS Feeds at a time...

2010-12-11 Thread Adam Estrada

 at 10:38 PM, Lance Norskog goks...@gmail.com wrote:

 There is I believe no way to do this without separate copies of your
 script. Each 'handler=/dataimport' has to refer to a separate config
 file.

 You can make several copies and name them config1.xml, config2.xml
 etc. You'll have to call each one manually, so you have to manage your
 own thread pool.

 On Fri, Dec 10, 2010 at 8:15 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  Right now I am using the default DIH config that comes with the Solr
  examples. I update my index using the dataimport handler here
 
  http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport
 
  This works fine but I want to be able to index more than just one feed at
 a
  time and more importantly I want to be able to index both ATOM and RSS
 feeds
  which means that the schema will definitely be different.
 
  There is a good example on how to index all of the example docs in the
  SolrNet example application but that is looking for xml files with the
  properly formatted xml tags.
 
 foreach (var file in
  Directory.GetFiles(Server.MapPath(/exampledocs), *.xml))
 {
 connection.Post(/update, File.ReadAllText(file,
  Encoding.UTF8));
 }
 solr.Commit();
 
  example xml:
 
  - add
   - doc
field name=*id*F8V7067-APL-KIT/field
field name=*name*Belkin Mobile Power Cord for iPod w/ Dock/field
field name=*manu*Belkin/field
field name=*cat*electronics/field
field name=*cat*connector/field
field name=*features*car power adapter, white/field
field name=*weight*4/field
field name=*price*19.95/field
field name=*popularity*1/field
field name=*inStock*false/field
field name=*manufacturedate_dt*2005-08-01T16:30:25Z/field
   /doc
  /add
 
  This obviously won't help me when trying to grab random RSS feeds so my
  question is, how can I ingest several feeds at a time? Can I do this
  programmatically or is there a configuration option I am missing?
 
  Thanks,
  Adam
 



 --
 Lance Norskog
 goks...@gmail.com

Re: [Multiple] RSS Feeds at a time...

2010-12-11 Thread Adam Estrada

You are da man! w00t!

adam

On Sat, Dec 11, 2010 at 4:48 PM, Ahmet Arslan iori...@yahoo.com wrote:

  I found that you can have a single config file that can
  have several
  entities in it. My question now is how can I add entities
  without restarting
  the Solr service?

 You mean changing and re-loading xml config file?

 dataimport?command=reload-config
 http://wiki.apache.org/solr/DataImportHandler#Commands

[Parsing] Date Fields

2010-12-11 Thread Adam Estrada

All,

I am ingesting a lot of RSS feeds as part of my application and I keep
getting the same error.

WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38
+
at java.text.DateFormat.parse(Unknown Source)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date
FormatTransformer.java:89)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow
(DateFormatTransformer.java:69)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf
ormer(EntityProcessorWrapper.java:195)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
ityProcessorWrapper.java:241)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:357)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:370)
Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully
Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete
s=false)

Are there any tips or tricks to getting standard RSS update fields to
import correctly?

An example for a DIH config XML file is as follows:

  entity name=CBS
pk=link
datasource=filedatasource
url=http://feeds.cbsnews.com/CBSNewsMain?format=xml;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer,HTMLStripTransformer
 field column=source   xpath=/rss/channel/title
commonField=true /
field column=source-link  xpath=/rss/channel/link
 commonField=true /
field column=subject  xpath=/rss/channel/description
commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
stripHTML=true /
field column=creator  xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
  /entity

Any tips on this would be really appreciated as I need to query based on the
date the article was published.

Thanks,
Adam

Re: Indexing documents with SOLR

2010-12-10 Thread Adam Estrada

Nutch is also a great option if you want a crawler. I have found that you
will need to use the latest version of PDFBox and a it's dependencies for
better results. Also, make sure to set JAVA_OPT to something really large so
that you won't exceed your heap size.

Adam

On Fri, Dec 10, 2010 at 6:27 AM, Tommaso Teofili
tommaso.teof...@gmail.comwrote:

 Hi Pankaj,
 you can find the needed documentation right here [1].
 Hope this helps,
 Tommaso

 [1] : http://wiki.apache.org/solr/ExtractingRequestHandler

 2010/12/10 pankaj bhatt panbh...@gmail.com

  Hi All,
   I am a newbie to SOLR and trying to integrate TIKA + SOLR.
   Can anyone please guide me, how to achieve this.
 
  * My Req is:* I have a directory containing a lot of PDF,DOC's and i need
  to
  make a search within the documents. I am using SOLR web application.
 
I just need some sample xml code both for solr-config.xml and
 the
  directory-schema.xml
 Awaiting eagerly for your response.
 
  Regards,
  Pankaj Bhatt.

[Multiple] RSS Feeds at a time...

2010-12-10 Thread Adam Estrada

All,

Right now I am using the default DIH config that comes with the Solr
examples. I update my index using the dataimport handler here

http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport

This works fine but I want to be able to index more than just one feed at a
time and more importantly I want to be able to index both ATOM and RSS feeds
which means that the schema will definitely be different.

There is a good example on how to index all of the example docs in the
SolrNet example application but that is looking for xml files with the
properly formatted xml tags.

foreach (var file in
Directory.GetFiles(Server.MapPath(/exampledocs), *.xml))
{
connection.Post(/update, File.ReadAllText(file,
Encoding.UTF8));
}
solr.Commit();

example xml:

- add
 - doc
   field name=*id*F8V7067-APL-KIT/field
   field name=*name*Belkin Mobile Power Cord for iPod w/ Dock/field
   field name=*manu*Belkin/field
   field name=*cat*electronics/field
   field name=*cat*connector/field
   field name=*features*car power adapter, white/field
   field name=*weight*4/field
   field name=*price*19.95/field
   field name=*popularity*1/field
   field name=*inStock*false/field
   field name=*manufacturedate_dt*2005-08-01T16:30:25Z/field
  /doc
/add

This obviously won't help me when trying to grab random RSS feeds so my
question is, how can I ingest several feeds at a time? Can I do this
programmatically or is there a configuration option I am missing?

Thanks,
Adam

[Multiple] RSS Feeds and Source Field

2010-12-09 Thread Adam Estrada

All,

I am indexing RSS feeds from several sources so I have a couple questions.

1. There is only 1 source for each RSS feed which is typically the name of
the feed, I get an error in my app stating 
*Value cannot be null.
Parameter name: source*
I look at the index in Luke and there are data values in there. Any ideas on
why my app would be throwing that?

2. I would like to ingest several feeds at a time. What is the proper way to
define them in a the XML config file? Can I have two document tags in
there or am I limited to just one?

Adam

Re: [Multiple] RSS Feeds and Source Field

2010-12-09 Thread Adam Estrada

In Luke I looked at the available fields and term counts per field and there
is a source field without an asterisk beside it. The source value is
CNN.com which is what I would expect it to be. I still get a null value in
my app which is probably a bug somewhere in my application.

Any more of your suggestions on the index would be greatly appreciated

Adam

On Thu, Dec 9, 2010 at 3:46 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 You look at what index in Luke?  I bet you $10 there is no index called
 source* in your index. With an asterisk in it.


 On 12/9/2010 3:23 PM, Adam Estrada wrote:

 All,

 I am indexing RSS feeds from several sources so I have a couple questions.

 1. There is only 1 source for each RSS feed which is typically the name of
 the feed, I get an error in my app stating 
 *Value cannot be null.
 Parameter name: source*
 I look at the index in Luke and there are data values in there. Any ideas
 on
 why my app would be throwing that?

 2. I would like to ingest several feeds at a time. What is the proper way
 to
 define them in a the XML config file? Can I have twodocument  tags in
 there or am I limited to just one?

 Adam

Re: Open source Solr UI with multiple select faceting?

2010-12-09 Thread Adam Estrada

SolrNet has a great example application that you can use...There is a great
Javascript project called SolrAjax but I don't know what the state of it is.

Adam

On Thu, Dec 9, 2010 at 4:53 PM, Andy angelf...@yahoo.com wrote:

 Hi,

 Any open source Solr UI's that support selecting multiple facet values
 (OR faceting)? For example allowing a user to select red or blue for
 the facet field Color.

 I'd prefer libraries in javascript or Python. I know about ajax-solr but it
 doesn't seem to support multiple selects.

 Thanks.

Re: [Multiple] RSS Feeds and Source Field

2010-12-09 Thread Adam Estrada

I ended up copying the source field to another which seems to have fixed the
problem...I still have so much to learn about when it comes to using Solr...

Thanks for all the great feedback,
Adam

On Thu, Dec 9, 2010 at 11:03 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, you say you get an error i your app. I'm a bit confused. Is
 this before you try to send it to Solr or as a result of sending it to
 Solr?
 If the latter, I'd wager source is required in your schema and  you're not
 sending it in your document. Try instrumenting your app to check
 that every outgoing document has a value

 If that's irrelevant, can we see your schema file?

 You can send as many documents in a packet as you want.

 Best
 Erick

 On Thu, Dec 9, 2010 at 3:23 PM, Adam Estrada
 estrada.adam.gro...@gmail.comwrote:

  All,
 
  I am indexing RSS feeds from several sources so I have a couple
 questions.
 
  1. There is only 1 source for each RSS feed which is typically the name
 of
  the feed, I get an error in my app stating 
  *Value cannot be null.
  Parameter name: source*
  I look at the index in Luke and there are data values in there. Any ideas
  on
  why my app would be throwing that?
 
  2. I would like to ingest several feeds at a time. What is the proper way
  to
  define them in a the XML config file? Can I have two document tags in
  there or am I limited to just one?
 
  Adam

[Casting] values on update/csv

2010-12-08 Thread Adam Estrada

All,

I have a csv file and I want to store one of the fields as a tdouble type.
It does not like that at all...Is there a way to cast the string value to a
tdouble?

Thanks,
Adam

Re: [Casting] values on update/csv

2010-12-08 Thread Adam Estrada

Hi,

I am using curl to run the following and as soon as I convert the field type
from string to tdouble, I get the errors you see below.

0:0:0:0:0:0:0:1 -  -  [08/12/2010:23:28:27 +] GET
/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C:\tmp\allCountries\xaa.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
HTTP/1.1 500 4023

I am trying to index coordinates in decimal degrees so many of them have
negative values. Could this be the problem?


Dec 8, 2010 6:28:27 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NumberFormatException: For input string: lat
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at org.apache.solr.schema.TrieField.createField(TrieField.java:431)
at
org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
a:246)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
ateProcessorFactory.java:60)
at
org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:386)
at
org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHand
ler.java:400)
at
org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:363)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3
65)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav
a:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1
81)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7
12)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)

at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand
lerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.
java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1
39)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:50
2)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpCo
nnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.
java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool
.java:442)

Dec 8, 2010 6:28:27 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/csv
params={fieldnames=id,name,asciiname,lat,
lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catcommi
t=trueoverwrite=truestream.contentType=text/plain;charset%3Dutf-8separator=,
stream.file=C:\tmp\allCountries\xaa.csv} status=500 QTime=52
Dec 8, 2010 6:28:27 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NumberFormatException: For input string: lat
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Double.parseDouble(Unknown Source)
at org.apache.solr.schema.TrieField.createField(TrieField.java:431)
at
org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
a:246)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
ateProcessorFactory.java:60)
at
org.apache.solr.handler.CSVLoader.doAdd(CSVRequestHandler.java:386)
at
org.apache.solr.handler.SingleThreadedCSVLoader.addDoc(CSVRequestHand
ler.java:400)
at
org.apache.solr.handler.CSVLoader.load(CSVRequestHandler.java:363)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle

Re: Batch Update Fields

2010-12-05 Thread Adam Estrada

OK so the way I understand this is that if there is a synonym on a specific
field at index time, that value will be stored rather than the one in the
csv that I am indexing? I will give it a whirl and report back...

Thanks!
Adam

On Sat, Dec 4, 2010 at 2:27 PM, Erick Erickson erickerick...@gmail.comwrote:

 When you define your fieldType at index time. My idea
 was that you substitue these on the way in to your
 index. You may need a specific field type just for your
 country conversion Perhaps in a copyField if
 you need both the code and full name

 Best
 Erick

 On Sat, Dec 4, 2010 at 12:16 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  Synonyms eh? I have a synonym list like the following so how do I
 identify
  the synonyms on a specific field. The only place the field is used is as
 a
  facet.
 
  original field = country name
 
  AF = AFGHANISTAN
  AX = ÅLAND ISLANDS
  AL = ALBANIA
  DZ = ALGERIA
  AS = AMERICAN SAMOA
  AD = ANDORRA
  AO = ANGOLA
  AI = ANGUILLA
  AQ = ANTARCTICA
  AG = ANTIGUA AND BARBUDA
  AR = ARGENTINA
  AM = ARMENIA
  AW = ARUBA
  AU = AUSTRALIA
  AT = AUSTRIA
  etc...
 
  Any advise on that would be great and very much appreciated!
 
  Adam
 
  On Fri, Dec 3, 2010 at 3:55 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   That will certainly work. Another option, assuming the country codes
 are
   in their own field would be to put the transformations into a synonym
  file
   that was only used on that field. That way you'd get this without
 having
   to do the pre-process step of the raw data...
  
   That said, if you pre-processing is working for you it may  not be
 worth
   your while
   to worry about doing it differently
  
   Best
   Erick
  
   On Fri, Dec 3, 2010 at 12:51 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com
wrote:
  
First off...I know enough about Solr to be VERY dangerous so please
  bare
with me ;-) I am indexing the geonames database which only provides
   country
codes. I can facet the codes but to the end user who may not know all
  249
codes, it isn't really all that helpful. Therefore, I want to map the
   full
country names to the country codes provided in the geonames db.
http://download.geonames.org/export/dump/
   
http://download.geonames.org/export/dump/I used a simple split
   function
to
chop the 850 meg txt file in to manageable csv's that I can import in
  to
Solr. Now that all 7 million + documents are in there, I want to
 change
   the
country codes to the actual country names. I would of liked to have
  done
   it
in the index but finding and replacing the strings in the csv seems
 to
  be
working fine. After that I can just reindex the entire thing.
   
Adam
   
On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 Have you consider defining synonyms for your code -country
 conversion at index time (or query time for that matter)?

 We may have an XY problem here. Could you state the high-level
 problem you're trying to solve? Maybe there's a better solution...

 Best
 Erick

 On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  I wonder...I know that sed would work to find and replace the
 terms
   in
 all
  of the csv files that I am indexing but would it work to find and
replace
  key terms in the index?
 
  find C:\\tmp\\index\\data -type f -exec sed -i
 's/AF/AFGHANISTAN/g'
   {}
\;
 
  That command would iterate through all the files in the data
   directory
 and
  replace the country code with the full country name. I many just
  back
up
  the
  directory and try it. I have it running on csv files right now
 and
   it's
  working wonderfully. For those of you interested, I am indexing
 the
 entire
  Geonames dataset
 http://download.geonames.org/export/dump/(allCountries.zip)
  which gives me a pretty comprehensive world gazetteer. My next
 step
   is
  gonna
  be to display the results as KML to view over a google globe.
 
  Thoughts?
 
  Adam
 
  On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson 
erickerick...@gmail.com
  wrote:
 
   No, there's no equivalent to SQL update for all values in a
  column.
  You'll
   have to reindex all the documents.
  
   On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com
wrote:
  
OK part 2 of my previous question...
   
Is there a way to batch update field values based on a
 certain
  criteria?
For example, if thousands of documents have a field value of
  'US'
can
 I
update all of them to 'United States' programmatically?
   
Adam

Re: Batch Update Fields

2010-12-04 Thread Adam Estrada

Synonyms eh? I have a synonym list like the following so how do I identify
the synonyms on a specific field. The only place the field is used is as a
facet.

original field = country name

AF = AFGHANISTAN
AX = ÅLAND ISLANDS
AL = ALBANIA
DZ = ALGERIA
AS = AMERICAN SAMOA
AD = ANDORRA
AO = ANGOLA
AI = ANGUILLA
AQ = ANTARCTICA
AG = ANTIGUA AND BARBUDA
AR = ARGENTINA
AM = ARMENIA
AW = ARUBA
AU = AUSTRALIA
AT = AUSTRIA
etc...

Any advise on that would be great and very much appreciated!

Adam

On Fri, Dec 3, 2010 at 3:55 PM, Erick Erickson erickerick...@gmail.comwrote:

 That will certainly work. Another option, assuming the country codes are
 in their own field would be to put the transformations into a synonym file
 that was only used on that field. That way you'd get this without having
 to do the pre-process step of the raw data...

 That said, if you pre-processing is working for you it may  not be worth
 your while
 to worry about doing it differently

 Best
 Erick

 On Fri, Dec 3, 2010 at 12:51 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  First off...I know enough about Solr to be VERY dangerous so please bare
  with me ;-) I am indexing the geonames database which only provides
 country
  codes. I can facet the codes but to the end user who may not know all 249
  codes, it isn't really all that helpful. Therefore, I want to map the
 full
  country names to the country codes provided in the geonames db.
  http://download.geonames.org/export/dump/
 
  http://download.geonames.org/export/dump/I used a simple split
 function
  to
  chop the 850 meg txt file in to manageable csv's that I can import in to
  Solr. Now that all 7 million + documents are in there, I want to change
 the
  country codes to the actual country names. I would of liked to have done
 it
  in the index but finding and replacing the strings in the csv seems to be
  working fine. After that I can just reindex the entire thing.
 
  Adam
 
  On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Have you consider defining synonyms for your code -country
   conversion at index time (or query time for that matter)?
  
   We may have an XY problem here. Could you state the high-level
   problem you're trying to solve? Maybe there's a better solution...
  
   Best
   Erick
  
   On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com
wrote:
  
I wonder...I know that sed would work to find and replace the terms
 in
   all
of the csv files that I am indexing but would it work to find and
  replace
key terms in the index?
   
find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g'
 {}
  \;
   
That command would iterate through all the files in the data
 directory
   and
replace the country code with the full country name. I many just back
  up
the
directory and try it. I have it running on csv files right now and
 it's
working wonderfully. For those of you interested, I am indexing the
   entire
Geonames dataset
   http://download.geonames.org/export/dump/(allCountries.zip)
which gives me a pretty comprehensive world gazetteer. My next step
 is
gonna
be to display the results as KML to view over a google globe.
   
Thoughts?
   
Adam
   
On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 No, there's no equivalent to SQL update for all values in a column.
You'll
 have to reindex all the documents.

 On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  OK part 2 of my previous question...
 
  Is there a way to batch update field values based on a certain
criteria?
  For example, if thousands of documents have a field value of 'US'
  can
   I
  update all of them to 'United States' programmatically?
 
  Adam

Re: Batch Update Fields

2010-12-03 Thread Adam Estrada

I wonder...I know that sed would work to find and replace the terms in all
of the csv files that I am indexing but would it work to find and replace
key terms in the index?

find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \;

That command would iterate through all the files in the data directory and
replace the country code with the full country name. I many just back up the
directory and try it. I have it running on csv files right now and it's
working wonderfully. For those of you interested, I am indexing the entire
Geonames dataset http://download.geonames.org/export/dump/ (allCountries.zip)
which gives me a pretty comprehensive world gazetteer. My next step is gonna
be to display the results as KML to view over a google globe.

Thoughts?

Adam

On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.comwrote:

 No, there's no equivalent to SQL update for all values in a column. You'll
 have to reindex all the documents.

 On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  OK part 2 of my previous question...
 
  Is there a way to batch update field values based on a certain criteria?
  For example, if thousands of documents have a field value of 'US' can I
  update all of them to 'United States' programmatically?
 
  Adam

Re: Batch Update Fields

2010-12-03 Thread Adam Estrada

First off...I know enough about Solr to be VERY dangerous so please bare
with me ;-) I am indexing the geonames database which only provides country
codes. I can facet the codes but to the end user who may not know all 249
codes, it isn't really all that helpful. Therefore, I want to map the full
country names to the country codes provided in the geonames db.
http://download.geonames.org/export/dump/

http://download.geonames.org/export/dump/I used a simple split function to
chop the 850 meg txt file in to manageable csv's that I can import in to
Solr. Now that all 7 million + documents are in there, I want to change the
country codes to the actual country names. I would of liked to have done it
in the index but finding and replacing the strings in the csv seems to be
working fine. After that I can just reindex the entire thing.

Adam

On Fri, Dec 3, 2010 at 12:42 PM, Erick Erickson erickerick...@gmail.comwrote:

 Have you consider defining synonyms for your code -country
 conversion at index time (or query time for that matter)?

 We may have an XY problem here. Could you state the high-level
 problem you're trying to solve? Maybe there's a better solution...

 Best
 Erick

 On Fri, Dec 3, 2010 at 12:20 PM, Adam Estrada 
 estrada.adam.gro...@gmail.com
  wrote:

  I wonder...I know that sed would work to find and replace the terms in
 all
  of the csv files that I am indexing but would it work to find and replace
  key terms in the index?
 
  find C:\\tmp\\index\\data -type f -exec sed -i 's/AF/AFGHANISTAN/g' {} \;
 
  That command would iterate through all the files in the data directory
 and
  replace the country code with the full country name. I many just back up
  the
  directory and try it. I have it running on csv files right now and it's
  working wonderfully. For those of you interested, I am indexing the
 entire
  Geonames dataset
 http://download.geonames.org/export/dump/(allCountries.zip)
  which gives me a pretty comprehensive world gazetteer. My next step is
  gonna
  be to display the results as KML to view over a google globe.
 
  Thoughts?
 
  Adam
 
  On Fri, Dec 3, 2010 at 7:57 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   No, there's no equivalent to SQL update for all values in a column.
  You'll
   have to reindex all the documents.
  
   On Thu, Dec 2, 2010 at 10:52 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com
wrote:
  
OK part 2 of my previous question...
   
Is there a way to batch update field values based on a certain
  criteria?
For example, if thousands of documents have a field value of 'US' can
 I
update all of them to 'United States' programmatically?
   
Adam

Joining Fields in and Index

2010-12-02 Thread Adam Estrada

All,

I have an index that has a field with country codes in it. I have 7 million or 
so documents in the index and when displaying facets the country codes don't 
mean a whole lot to me. Is there any way to add a field with the full country 
names then join the codes in there accordingly? I suppose I can do this before 
updating the records in the index but before I do that I would like to know if 
there is a way to do this sort of join. 

Example: US - United States

Thanks,
Adam

Re: Joining Fields in and Index

2010-12-02 Thread Adam Estrada

Hi,

I was hoping to do it directly in the index but it was more out of curiosity 
than anything. I can certainly map it in the DAO but again...I was hoping to 
learn if it was possible in the index.

Thanks for the feedback!

Adam

On Dec 2, 2010, at 5:48 PM, Savvas-Andreas Moysidis wrote:

 Hi,
 
 If you are able to do a full re-index then you could index the full names
 and not the codes. When you later facet on the Country field you'll get the
 actual name rather than the code.
 If you are not able to re-index then probably this conversion could be added
 at your application layer prior to displaying your results.(e.g. in your DAO
 object)
 
 On 2 December 2010 22:05, Adam Estrada estrada.adam.gro...@gmail.comwrote:
 
 All,
 
 I have an index that has a field with country codes in it. I have 7 million
 or so documents in the index and when displaying facets the country codes
 don't mean a whole lot to me. Is there any way to add a field with the full
 country names then join the codes in there accordingly? I suppose I can do
 this before updating the records in the index but before I do that I would
 like to know if there is a way to do this sort of join.
 
 Example: US - United States
 
 Thanks,
 Adam

Using Multiple Cores for Multiple Users

2010-11-09 Thread Adam Estrada

All,

I have a web application that requires the user to register and then login
to gain access to the site. Pretty standard stuff...Now I would like to know
what the best approach would be to implement a customized search
experience for each user. Would this mean creating a separate core per user?
I think that this is not possible without restarting Solr after each core is
added to the multi-core xml file, right?

My use case is this...User A would like to index 5 RSS feeds and User B
would like to index 5 completely different RSS feeds and he is not
interested at all in what User A is interested in. This means that they
would have to be separate index cores, right?

What is the best approach for this kind of thing?

Thanks in advance,
Adam

Re: Using Multiple Cores for Multiple Users

2010-11-09 Thread Adam Estrada

Thanks a lot for all the tips, guys! I think that we may explore both
options just to see what happens. I'm sure that scalability will be a huge
mess with the core-per-user scenario. I like the idea of creating a user ID
field and agree that it's probably the best approach. We'll see...I will be
sure to let the list know what I find! Please don't stop posting your
comments everyone ;-) My inquiring mind wants to know...

Adam

On Tue, Nov 9, 2010 at 7:34 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 If storing in a single index (possibly sharded if you need it), you can
 simply include a solr field that specifies the user ID of the saved thing.
 On the client side, in your application, simply ensure that there is an fq
 parameter limiting to the current user, if you want to limit to the current
 user's stuff.  Relevancy ranking should work just as if you had 'seperate
 cores', there is no relevancy issue.

 It IS true that when your index gets very large, commits will start taking
 longer, which can be a problem. I don't mean commits will take longer just
 because there is more stuff to commit -- the larger the index, the longer an
 update to a single document will take to commit.

 In general, i suspect that having dozens or hundreds (or thousands!) of
 cores is not going to scale well, it is not going to make good use of your
 cpu/ram/hd resources.   Not really the intended use case of multiple cores.

 However, you are probably going to run into some issues with the single
 index approach too. In general, how to deal with multi-tenancy in Solr is
 an oft-asked question that there doesn't seem to be any just works and does
 everything for you without needing to think about it solution for in solr.
 Judging from past thread. I am not a Solr developer or expert.

 
 From: Markus Jelsma [markus.jel...@openindex.io]
 Sent: Tuesday, November 09, 2010 6:57 PM
 To: solr-user@lucene.apache.org
 Cc: Adam Estrada
 Subject: Re: Using Multiple Cores for Multiple Users

 Hi,

  All,
 
  I have a web application that requires the user to register and then
 login
  to gain access to the site. Pretty standard stuff...Now I would like to
  know what the best approach would be to implement a customized search
  experience for each user. Would this mean creating a separate core per
  user? I think that this is not possible without restarting Solr after
 each
  core is added to the multi-core xml file, right?

 No, you can dynamically manage cores and parts of their configuration.
 Sometimes you must reindex after a change, the same is true for reloading
 cores. Check the wiki on this one [1].

 
  My use case is this...User A would like to index 5 RSS feeds and User B
  would like to index 5 completely different RSS feeds and he is not
  interested at all in what User A is interested in. This means that they
  would have to be separate index cores, right?

 If you view documents within an rss feed as a separate documents, you can
 assign an user ID to those documents, creating a multi user index with rss
 documents per user, or group or whatever.

 Having a core per user isn't a good idea if you have many users.  It takes
 up
 additional memory and disk space, doesn't share caches etc.  There is also
 more maintenance and your need some support scripts to dynamically create
 new
 cores - Solr currently doesn't create a new core directory structure.

 But, reindexing a very large index takes up a lot more time and resources
 and
 relevancy might be an issue depending on the rss feeds' contents.

 
  What is the best approach for this kind of thing?

 I'd usually store the feeds in a single index and shard if it's too many
 for a
 single server with your specifications. Unless the demands are too
 specific.

 
  Thanks in advance,
  Adam

 [1]: http://wiki.apache.org/solr/CoreAdmin

 Cheers

1 2 >

1 - 100 of 104 matches

Mail list logo