from:"Jayendra Patil"

Re: less search results in prod

2011-12-03 Thread Jayendra Patil

enable debugQuery and compare the queries evaluated in the development
and production environment.

Regards,
Jayendra

On Sun, Dec 4, 2011 at 5:18 AM,  alx...@aim.com wrote:
 Hello,

 I have build solr-3.4.0 data folder in dev server and copied it to prod 
 server. Made a search for a keyword, then modified qf and pf params in 
 solrconfig.xml. Made search for the same keywords, then restored qf and pf 
 params to their original value. Now, solr returns very less number of docs 
 for the same keywords in comparison with the dev server. Tried other 
 keywords, the issue is the same. Copied solrconfig.xml from dev server, but 
 nothing changed.  Took a look to statistics, the numDocs and maxDoc values 
 are the same in both servers.







 Any ideas how to debug this issue?

 Thanks in advance.
 Alex.

Re: How to change the port of post.jar

2011-11-08 Thread Jayendra Patil

You can pass the full url to post.jar as an argument.

example -

java -Durl=http://localhost:8080/solr/update -jar post.jar

Regards,
Jayendra

On Wed, Nov 9, 2011 at 2:37 AM, 刘浪 liu.l...@eisoo.com wrote:

 Hi,
     I want to use post.jar to delete index.But my port is 8080. It is 8983 
 default.
     How can I change the port 8983 to 8080?

 Thank you,
 Amos


 --

Re: question about Field Collapsing/ grouping

2011-09-14 Thread Jayendra Patil

Hi Ahson,

http://wiki.apache.org/solr/FieldCollapsing
group.ngroups seems to be added as an parameter, so you may not be
needed to apply any patches.

Solr 3.3 had released the grouping feature with it, so I presume it
should already be included in it.

Regards,
Jayendra

On Wed, Sep 14, 2011 at 4:22 AM, Ahson Iqbal mianah...@yahoo.com wrote:
 Hi Jayendra

 Thanks a lot  for your response, now i have two questions one that to get the 
 count of groups is it must to apply the specified patch, if so can you help 
 me a little how i can apply that patch in steps as i am new to solr/java.

 Regards
 Ahsan



 - Original Message -
 From: Jayendra Patil jayendra.patil@gmail.com
 To: solr-user@lucene.apache.org; Ahson Iqbal mianah...@yahoo.com
 Cc:
 Sent: Tuesday, September 13, 2011 10:55 AM
 Subject: Re: question about Field Collapsing/ grouping

 The time we implemented the feature, there was no straight forward solution.

 What we did is to facet on the grouped by field and counting the facets.
 This would give you the distinct count for the groups.

 You may also want to check the Patch @
 https://issues.apache.org/jira/browse/SOLR-2242, which will return the
 facet counts and you need to count it by yourself.

 Regards,
 Jayendra

 On Tue, Sep 13, 2011 at 1:27 AM, Ahson Iqbal mianah...@yahoo.com wrote:
 Hi

 Is it possible to get number of groups that matched with specified query.

 like let say there are three fields in index

 DocumentID
 Content
 Industry


 and now i want to query as +(Content:is Content:the)
 group=truegroup.field=industry

 now is it possible to get how many industries matched with specified query.

 Please help.

 Regards
 Ahsan

Re: question about Field Collapsing/ grouping

2011-09-13 Thread Jayendra Patil

yup .. seems the group count feature is included now, as mentioned by Klein.

Regards,
Jayendra

On Tue, Sep 13, 2011 at 8:27 AM, O. Klein kl...@octoweb.nl wrote:
 Isn't that what the parameter group.ngroups=true is for?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/question-about-Field-Collapsing-grouping-tp3331821p3332471.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: question about Field Collapsing/ grouping

2011-09-12 Thread Jayendra Patil

The time we implemented the feature, there was no straight forward solution.

What we did is to facet on the grouped by field and counting the facets.
This would give you the distinct count for the groups.

You may also want to check the Patch @
https://issues.apache.org/jira/browse/SOLR-2242, which will return the
facet counts and you need to count it by yourself.

Regards,
Jayendra

On Tue, Sep 13, 2011 at 1:27 AM, Ahson Iqbal mianah...@yahoo.com wrote:
 Hi

 Is it possible to get number of groups that matched with specified query.

 like let say there are three fields in index

 DocumentID
 Content
 Industry


 and now i want to query as +(Content:is Content:the)
 group=truegroup.field=industry

 now is it possible to get how many industries matched with specified query.

 Please help.

 Regards
 Ahsan

Re: Accessing a doc field while working at entity level

2011-09-06 Thread Jayendra Patil

you should be able to do it using ${feed-source.last-update}

You can find examples and explaination @
http://wiki.apache.org/solr/DataImportHandler

Regards,
Jayendra

On Mon, Sep 5, 2011 at 8:02 AM, penela pen...@gmail.com wrote:
 Hi!

 This might probably be a stupid question, but I can't find clear info on how
 to do it (sorry if it is too obvious).

 I have a the following document configuration (only key elements shown) with
 two entities, one embedded into the other:
 dataConfig
    dataSource type=URLDataSource name=rss-ds /
    dataSource type=JdbcDataSource name=db-ds driver=com /
    document
                entity name=feed-source
                                dataSource=db-ds
                                ...
                                rootEntity=false
                        field column=last-update dateTimeFormat=-MM-dd 
 HH:mm:ss
 locale=en  /

                        entity name=feed-content dataSource=rss-ds
                                        pk=link
                        ...
                        transformer=DateFormatTransformer, DummyTransformer

                    field column=timestamp xpath=/rss/channel/item/pubDate
 dateTimeFormat=EEE, dd MMM  HH:mm:ss z locale=en  /
                /entity
        /entity
    /document
 /dataConfig

 What I want to do is accessing the outer entity field last-update while
 I'm in the inner entity Transformer DummyTransformer. Debugging with Eclipse
 it looks like that data is correctly stored during runtime on the Context
 variable passed as parameter to the Transformer in: context.doc.fields

 So the question is: Is there any way to access higher level entities' fields
 while in an embedded entity? Or document fields at least?

 Thanks!
 -Víctor

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Accessing-a-doc-field-while-working-at-entity-level-tp3310680p3310680.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search the contents of given URL in Solr.

2011-08-30 Thread Jayendra Patil

For indexing the webpages, you can use Nutch with Solr, which would do
the scarping and indexing of the page.
For finding similar documents/pages you can use
http://wiki.apache.org/solr/MoreLikeThis, by querying the above
document (by id or search terms) and it would return similar documents
from the index for the result.

Regards,
Jayendra

On Tue, Aug 30, 2011 at 8:23 AM, Sheetal rituzprad...@gmail.com wrote:
 Hi,

 Is it possible to give the URL address of a site and solr search server
 reads the contents of the given site and recommends similar projects to
 that. I did scrapped the web contents from the given URL address and now
 have the plain text format of the contents in URL. But when I pass that
 scrapped text as query into Solr. It doesn't work as query being too
 large(depends on the given contents of URL).

 I read it somewhere that its possible , Given the URL address and outputs
 you the relevant projects to it. But I don't remember whether its using Solr
 search or other search engine.

 Does anyone have any ideas or suggestions for this..Would highly appreciate
 your comments

 Thank you in advance..

 -
 Sheetal
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-the-contents-of-given-URL-in-Solr-tp3294376p3294376.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to get all the terms in a document as Luke does?

2011-08-30 Thread Jayendra Patil

you might want to check - http://wiki.apache.org/solr/TermVectorComponent
Should provide you with the term vectors with a lot of additional info.

Regards,
Jayendra

On Tue, Aug 30, 2011 at 3:34 AM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 Hello,

 This time I'm trying to duplicate Luke's functionality of knowing which
 terms occur in a search result/document (w/o parsing it again). Any Solrj
 API to do that?

 P.S. I've also posted the question on
 SOhttp://stackoverflow.com/q/7219111/300248
 .

 On Wed, Jul 6, 2011 at 11:09 AM, Gabriele Kahlout
 gabri...@mysimpatico.comwrote:

 From you patch I see TermFreqVector  which provides the information I
 want.

 I also found FieldInvertState.getLength() which seems to be exactly what I
 want. I'm after the word count (sum of tf for every term in the doc). I'm
 just not sure whether FieldInvertState.getLength() returns just the number
 of terms (not multiplied by the frequency of each term - word count) or not
 though. It seems as if it returns word count, but I've not tested it
 sufficienctly.


 On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger 
 the.apache.t...@gmail.comwrote:

 Gabriele,

 I created a patch that does this about a year ago.  See
 https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
 1.4 and is based upon the Document Reconstructor in Luke.  The patch adds
 a
 link to the main solr admin page to a docinspector page which will
 reconstruct the document given a uniqueid (required).  Keep in mind that
 you're only looking at what's in the index for non-stored fields, not
 the
 original text.

 If you have any issues using this on the most recent release, let me know
 and I'd be happy to create a new patch for solr 3.3.  One of these days
 I'll
 remove the JSP dependency and this may eventually making it into trunk.

 Thanks,

 -Trey Grainger
 Search Technology Development Team Lead, Careerbuilder.com
 Site Architect, Celiaccess.com


 On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
 gabri...@mysimpatico.comwrote:

  Hello,
 
  With an inverted index the term is the key, and the documents are the
  values. Is it still however possible that given a document id I get the
  terms indexed for that document?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).

Re: Upload doc and pdf in Solr 3.3.0

2011-08-25 Thread Jayendra Patil

http://wiki.apache.org/solr/ExtractingRequestHandler may help.

Regards,
Jayendra

On Thu, Aug 25, 2011 at 3:24 AM, Moinsn felix.wieg...@googlemail.com wrote:
 Good Morning,

 I have to set up a Solr System to seek in documents like pdf and doc. My
 Solr System is running in the meantime, but i cant find a tutorial that
 tells me what i have to do to put the files in the system.
 I hope you can help me a bit to bring that off on a simple way.
 And please excuse my bad english.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Upload-doc-and-pdf-in-Solr-3-3-0-tp3283224p3283224.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Issue in indexing Zip file content with apache-solr-3.3.0

2011-08-23 Thread Jayendra Patil

Solr doesn't index the content of the files, but just the file names.

you can apply patch -
https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

Regards,
Jayendra

On Tue, Aug 23, 2011 at 2:26 AM, Jagdish Kumar
jagdish.thapar...@hotmail.com wrote:

 Hi All

 I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am 
 able to index the zip files, but I get no results if I search for content 
 present in zip file. Please suggest possible solution.

 Thanks and regards
 Jagdish

Re: How to start troubleshooting a content extraction issue

2011-08-11 Thread Jayendra Patil

You can test the standalone content extraction with the tika-app.jar -

Command to output in text format -
java -jar tika-app-0.8.jar --text file_path

For more options java -jar tika-app-0.8.jar --help

Use the correct tika-app version jar matching the Solr build.

Regards,
Jayendra

On Wed, Aug 10, 2011 at 1:53 PM, Tim AtLee timat...@gmail.com wrote:
 Hello

 So, I'm a newbie to Solr and Tika and whatnot, so please use simple words
 for me :P

 I am running Solr on Tomcat 7 on Windows Server 2008 r2, running as the
 search engine for a Drupal web site.

 Up until recently, everything has been fine - searching works, faceting
 works, etc.

 Recently a user uploaded a 5mb xltm file, which seems to be causing Tomcat
 to spike in CPU usage, and eventually error out.  When the documents are
 submitted to be index, the tomcat process spikes up to use 100% of 1
 available CPU, with the eventual error in Drupal of Exception occured
 sending *sites/default/files/nodefiles/533/June 30, 2011.xltm* to Solr 0
 Status: Communication Error.

 I am looking for some help in figuring out where to troubleshoot this.  I
 assume it's this file, but I guess I'd like to be sure - so how can I submit
 this file for content extraction manually to see what happens?

 Thanks,

 Tim

Re: Possible bug in FastVectorHighlighter

2011-08-09 Thread Jayendra Patil

Try using -

 str name=hl.tag.pre![CDATA[b]]/str
 str name=hl.tag.post![CDATA[/b]]/str

Regards,
Jayendra


On Tue, Aug 9, 2011 at 4:46 AM, Massimo Schiavon mschia...@volunia.com wrote:
 In my Solr (3.3) configuration I specified these two params:

 str name=hl.simple.pre![CDATA[b]]/str
 str name=hl.simple.post![CDATA[/b]]/str

 when I do a simple search I obtain correctly highlighted results where
 matches areenclosed with correct tag.
 If I do the same request with hl.useFastVectorHighlighter=true in the http
 query string (or specifying the same parameter in the config file) the
 metches are enclosed with em tag (the default value).

 Anyone has encountered the same issue?

Re: Is there anyway to sort differently for facet values?

2011-08-05 Thread Jayendra Patil

you can give it a try with the facet.sort.

We had such a requirement for sorting facets by order determined by
other field and had to resort to a very crude way to get through it.
We pre-pended the facets values with the order in which it had to be
displayed ... and used the facet.sort to sort alphabetically.

e.g. prepend Small - 0_Small, Medium - 1_Medium, Large - 2_Large, XL - 3_XL

You would need to handle the display part though.

Surely not the best way, but worked for us.

Regards,
Jayendra

On Thu, Aug 4, 2011 at 4:38 PM, Sethi, Parampreet
parampreet.se...@teamaol.com wrote:
 It can be achieved by creating own (app specific) custom comparators for
 fields defined in schema.xml and having an extra attribute to specify the
 comparator class in the field tag itself. But it will require changes in the
 Solr to support this feature. (Not sure if it's feasible though just
 throwing an idea.)

 -param

 On 8/4/11 4:29 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 No, it can not. It just sorts alphabetically, actually by raw byte-order.

 No other facet sorting functionality is available, and it would be
 tricky to implement in a performant way because of the way lucene
 works.  But it would certainly be useful to me too if someone could
 figure out a way to do it.

 On 8/4/2011 2:43 PM, Way Cool wrote:
 Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
 I will try that though.

 Can it handle the values below in the correct order?
 Under 10
 10 - 20
 20 - 30
 Above 30

 Or
 Small
 Medium
 Large
 XL
 ...

 My second question is that if Solr can't do that for the values above by
 using facet.sort. Is there any other ways in Solr?

 Thanks in advance,

 YH

 On Wed, Aug 3, 2011 at 8:35 PM, Erick 
 Ericksonerickerick...@gmail.comwrote:

 have you looked at the facet.sort parameter? The index value is what I
 think you want.

 Best
 Erick
 On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com  wrote:
 Hi, guys,

 Is there anyway to sort differently for facet values? For example,
 sometimes
 I want to sort facet values by their values instead of # of docs, and I
 want
 to be able to have a predefined order for certain facets as well. Is that
 possible in Solr we can do that?

 Thanks,

 YH

Re: ' invisible ' words

2011-07-14 Thread Jayendra Patil

Strange .. the only other difference that I see is the different
configurations for the word delimiter filter, with the catenatewords
and catenatenumbers @ index and query but it should not impact normal
word searches.

As others suggested, you may just want to use the same chain for both
Index and query to start with and start with plain tokenizer and then
add up filters one by one.

Regards,
Jayendra

On Wed, Jul 13, 2011 at 11:29 PM, deniz denizdurmu...@gmail.com wrote:
 Hi Jayendra,

 I have changed the order and also removed the line related with synonyms...
 but the result is still the same... somehow some words are just invisible
 during my searches...

 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3168039.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: ' invisible ' words

2011-07-13 Thread Jayendra Patil

Hi Denis,

The order of the filter during index time and query time are different
e.g. the synonyms filter.
Do you have a custom synonyms text file which may be causing the issues ?

It usually works fine if you have the same filter order during Index
and Query time. You can try out.

Regards,
Jayendra

On Tue, Jul 12, 2011 at 11:19 PM, deniz denizdurmu...@gmail.com wrote:
 nothing was changed... the result is still the same... shuold i implement my
 own analyzer or tokenizer for the problem?

 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3164670.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Master Slave help

2011-06-06 Thread Jayendra Patil

Do you mean the replication happens everytime you restart the server ?
If so, you would need to modify the events you want the replication to happen.

Check for the replicateAfter tag and remove the startup option, if you
don't need it.

requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
!--Replicate on 'startup' and 'commit'. 'optimize' is also a
valid value for replicateAfter. --
str name=replicateAfterstartup/str
str name=replicateAftercommit/str

!--Create a backup after 'optimize'. Other values can be
'commit', 'startup'. It is possible to have multiple entries of this
config string.  Note that this is just for backup, replication does
not require this. --
!-- str name=backupAfteroptimize/str --

!--If configuration files need to be replicated give the
names here, separated by comma --
str name=confFilesschema.xml,stopwords.txt,elevate.xml/str
   !--The default value of reservation is 10 secs.See the
documentation below . Normally , you should not need to specify this
--
str name=commitReserveDuration00:00:10/str
/lst
/requestHandler

Regards,
Jayendra

On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta ro...@in-rev.com wrote:
 Hi,

 I have configured my master slave server and everything seems to be running
 fine, the replication completed the firsttime it ran. But everytime I go the 
 the
 replication link in the admin panel after restarting the server or server
 startup I notice the replication starting from scratch or at least the stats
 show that.

 What could be wrong?

 Thanks,
 Rohit

Re: Hitting the URI limit, how to get around this?

2011-06-02 Thread Jayendra Patil

just a suggestion ...

If the shards are know, you can add them as the default params in the
requesthandler so they are added always. and the URL would just have
the qt parameter.

As the limit for uri is browser dependent.
How are you querying solr .. any client api ?? through browser ??
is it hitting the max header length ?? Can you use post instead ??

Regards,
Jayendra

On Thu, Jun 2, 2011 at 7:12 PM, JohnRodey timothydd...@yahoo.com wrote:
 I have a master solr instance that I sent my request to, it hosts no
 documents it just farms the request out to a large number of shards. All the
 other solr instances that host the data contain multiple cores.

 Therefore my search string looks like
 http://host:port/solr/select?...shards=nodeA:1234/solr/core01,nodeA:1234/solr/core02,nodeA:1234/solr/core03,...;
 This shard list is pretty long and has finally hit the limit.

 So my question is how to best avoid having to build such a long uri?

 Is there a way to have mutiple tiers, where the master server has a list of
 servers (nodeA:1234,nodeB:1234,...) and each of those nodes query the cores
 that they host (nodeA hosts core01, core02, core03, ...)?

 Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Hitting-the-URI-limit-how-to-get-around-this-tp3017837p3017837.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Jayendra Patil

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

On Fri, May 20, 2011 at 11:15 AM, Gary Taylor g...@inovem.com wrote:
Hello again. Unfortunately, I'm still getting nowhere with this. I have
checked-out the 3.1 source and applied Jayendra's patches (see below) and it
still appears that the contents of the files in the zipfile are not being
indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F commit=true -F file=@solr1.zip

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm
expecting the contents of those txt files to be extracted from the zip and
indexed, but this isn't happening - or at least, I don't get the desired
result when I do a query afterwards. I do get a match if I search for
either doc1.txt or doc2.txt, but not if I search for a word that appears
in their contents.

If I index one of the txt files (instead of the zipfile), I can query the
content OK, so I'm assuming my query is sensible and matches the field
specified on the CURL string (ie. text). I'm also happy that the Solr
Cell content extraction is working because I can successfully index PDF,
Word, etc. files.

In a fit of desperation I have added log.info statements into the files
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those
in the log when I submit the zipfile with CURL, so I know I'm running those
patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this topic
cropped up again. It's currently a background task for me, so I'll try and
take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do list,
as is testing out the patches referenced by Jayendra. I'll post my findings
on this thread - if you manage to test the patches before me, let me know
how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com
wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content
from
archive file formats. I just tried again with a clean install of Solr
3.1.0
(using Tika 0.8) and continue to experience the same results. Did you
have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl

http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking
the
archive files. Based on the email chain associated with your first
message,
some people have been able to get this functionality to work as desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Jayendra Patil

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com wrote:
 Hi Gary,

 I have been experiencing the same problem... Unable to extract content from
 archive file formats.  I just tried again with a clean install of Solr 3.1.0
 (using Tika 0.8) and continue to experience the same results.  Did you have
 any success with this problem with Solr 1.4.1 or 3.1.0 ?

 I'm using this curl command to send data to Solr.
 curl 
 http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
 -H application/octet-stream -F  myfile=@data.zip

 No problem extracting single rich text documents, but archive files only
 result in the file names within the archive being indexed. Am I missing
 something else in my configuration? Solr doesn't seem to be unpacking the
 archive files. Based on the email chain associated with your first message,
 some people have been able to get this functionality to work as desired.

 On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote:

 Can anyone shed any light on this, and whether it could be a config issue?
  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.

 When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
 the ExtractingRequestHandler, I get the following log entry (formatted for
 ease of reading) :

 SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
        [stream_source_info, file, stream_content_type,
 application/octet-stream, stream_size, 260, stream_name, solr1.zip,
 Content-Type, application/zip]
        },
    ignored_=ignored_(1.0)={
        [package-entry, package-entry]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},

    ignored_stream_size=ignored_stream_size(1.0)={260},
    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
    ignored_content_type=ignored_content_type(1.0)={application/zip},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                  doc2.txt    doc1.txt    }
    }
 ]

 So, the data coming back from Tika when parsing a ZIP file does not include
 the file contents, only the names of the files contained therein.  I've
 tried forcing stream.type=application/zip in the CURL string, but that makes
 no difference.  If I specify an invalid stream.type then I get an exception
 response, so I know it's being used.

 When I send one of those txt files individually to the
 ExtractingRequestHandler, I get:

 SolrInputDocument[
    {
    ignored_meta=ignored_meta(1.0)={
        [stream_source_info, file, stream_content_type, text/plain,
 stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
        },
    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
    ignored_stream_size=ignored_stream_size(1.0)={30},
    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
    docid=docid(1.0)={74},
    type=type(1.0)={5},
    text=text(1.0)={                The quick brown fox  }
    }
 ]

 and we see the file contents in the text field.

 I'm using the following requestHandler definition in solrconfig.xml:

 !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --
 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 startup=lazy
 lst name=defaults
 !-- All the main content goes into text... if you need to return
           the extracted text or do highlighting, use a stored field. --
 str name=fmap.contenttext/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str

 !-- capture link hrefs but ignore div attributes --
 str name=captureAttrtrue/str
 str name=fmap.alinks/str
 str name=fmap.divignored_/str
 /lst
 /requestHandler

 Is there any further debug or diagnostic I can get out of Tika to help me
 work out why it's only returning the file names and not the file contents
 when parsing a ZIP file?


 Thanks and kind regards,
 Gary.



 On 25/01/2011 16:48, Jayendra Patil wrote:

 Hi Gary,

 The latest Solr Trunk was able to extract and index the contents of the
 zip
 file using the ExtractingRequestHandler.
 The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
 worked pretty well.

 Tested again with sample url and works fine -
 curl 

 http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true
 

 You would probably need

Re: Solrcore.properties

2011-03-28 Thread Jayendra Patil

Can you please attach the other files.
It doesn't seem to find the enable.master property, so you may want to
check the properties file exists on the box having issues

We have the following configuration in the core :-

Core -
- solrconfig.xml - Master  Slave
requestHandler name=/replication 
class=solr.ReplicationHandler 
lst name=master
 str 
name=enable${enable.master:false}/str
 str name=replicateAftercommit/str
 str 
name=confFilessolrcore_slave.properties:solrcore.properties,solrconfig.xml,schema.xml/str
/lst
lst name=slave
 str 
name=enable${enable.slave:false}/str
 str 
name=masterUrlhttp://master_host:port/solr/corename/replication/str
/lst
/requestHandler

- solrcore.properties - Master
enable.master=true
enable.slave=false

- solrcore_slave.properties - Slave
enable.master=false
enable.slave=true

We have the default values and separate properties file for Master and slave.
Replication is enabled for the solrcore.proerties file.

Regards,
Jayendra

On Mon, Mar 28, 2011 at 2:06 PM, Ezequiel Calderara ezech...@gmail.com wrote:
 Hi all, i'm having problems when deploying solr in the production machines.

 I have a master solr, and 3 slaves.
 The master replicates the schema and the solrconfig for the slaves (this
 file in the master is named like solrconfig_slave.xml).
 The solrconfig of the slaves has for example the ${data.dir} and other
 values in the solrtcore.properties

 I think that solr isn't recognizing that file, because i get this error:

 HTTP Status 500 - Severe errors in solr configuration. Check your log
 files for more detailed information on what may be wrong. If you want solr
 to continue after configuration errors, change:
 abortOnConfigurationErrorfalse/abortOnConfigurationError in null
 -
 org.apache.solr.common.SolrException: No system property or default value
 specified for enable.master at
 org.apache.solr.common.util.DOMUtil.substituteProperty(DOMUtil.java:311)
 ... MORE STACK TRACE INFO...

 But here is the thing:
 org.apache.solr.common.SolrException: No system property or default value
 specified for enable.master

 I'm attaching the master schema, the master solr config, the solr config of
 the slaves and the solrcore.properties.

 If anyone has any info on this i would be more than appreciated!...

 Thanks


 --
 __
 Ezequiel.

 Http://www.ironicnet.com

Re: Solr - multivalue fields - please help

2011-03-23 Thread Jayendra Patil

Just a suggestion ..
You can try using dynamic fields by appending the company name (or ID)
as prefix ... e.g.

For data -
Employee ID Employer FromDate ToDate
21345
IBM 01/01/04 01/01/06
MS 01/01/07 01/01/08
BT 01/01/09 Present

Index data as :-
Employee ID - 21345
Employer Name - IBM MS BT (Multivalued fields)
IBM_FROM_DATE - 01/01/04 (Dynamic field)
IBM_TO_DATE - 01/01/06 (Dynamic field)

You should be able to match the results and get the from and to date
for the companies and handle it on UI side.

Regards,
Jayendra

On Wed, Mar 23, 2011 at 8:24 AM, Sandra sclo...@consultant.com wrote:
 Hi everyone,

        I know that Solr cannot match 1 value in a multi-valued field with
 the corresponding value in another multi-valued field. However my data set
 appears to be in that form at the moment.
        With that in mind does anyone know of any good articles or
 discussions that have addressed this issue, specifically the alternatives
 that can be easily done/considered etc

 The data is of the following format:

        I have an unique Employee ID field, Employer (multi-value), FromDate
 (multi-value) amd ToDate (multi-value). For a given employee ID I am trying
 to return the relevent data. For example for a ID of 21345 and emplyer
 IMB return the work dates from and to. Or for same id and 2 work dates
 return the company of companies that the id was associated with etc


 Employee ID Employer FromDate ToDate
 21345 IBM 01/01/04 01/01/06
                        MS 01/01/07 01/01/08
                        BT 01/01/09 Present

        Any suggestions/comments/ideas/articles much appreciated...

 Thanks,
 S.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-multivalue-fields-please-help-tp2720067p2720067.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr coding

2011-03-23 Thread Jayendra Patil

Why not just add an extra field to the document in the Index for the
user, so you can easily filter out the results on the user field and
show only the documents submitted by the User.

Regards,
Jayendra

On Wed, Mar 23, 2011 at 9:20 AM, satya swaroop satya.yada...@gmail.com wrote:
 Hi All,
          As for my project Requirement i need to keep privacy for search of
 files so that i need to modify the code of solr,

 for example if there are 5 users and each user indexes some files as
  user1 - java1, c1,sap1
  user2 - java2, c2,sap2
  user3 - java3, c3,sap3
  user4 - java4, c4,sap4
  user5 - java5, c5,sap5

   and if a user2 searches for the keyword java then it should be display
 only  the file java2 and not other files

 so inorder to keep this filtering inside solr itself may i know where to
 modify the code... i will access a database to check the user indexed files
 and then filter the result... i didnt have any cores.. i indexed all files
 in a single index...

 Regards,
 satya

Re: Solr coding

2011-03-23 Thread Jayendra Patil

In that case, you may want to store the groups as multivalued fields
who would have access to the document.
A filter query on the user group should have the results filtered as
you expect.

you may also check Apache ManifoldCF as suggested by Szott.

Regards,
Jayendra

On Wed, Mar 23, 2011 at 9:46 AM, satya swaroop satya.yada...@gmail.com wrote:
 Hi Jayendra,
                I forgot to mention the result also depends on the group of
 user too It is some wat complex so i didnt tell it.. now i explain the
 exact way..

  user1, group1 - java1, c1,sap1
  user2 ,group2- java2, c2,sap2
  user3 ,group1,group3- java3, c3,sap3
  user4 ,group3- java4, c4,sap4
  user5 ,group3- java5, c5,sap5

                             user1,group1 means user1 belong to group1


 Here the filter includes the group too.., if for eg: user1 searches for
 java then the results should show as java1,java3 since java3 file is
 acessable to all users who are related to the group1, so i thought of to
 edit the code...

 Thanks,
 satya

Re: Logic operator with dismax

2011-03-21 Thread Jayendra Patil

Dismax does not support boolean queries, you may try using Extended
Dismax for the boolean support.
https://issues.apache.org/jira/browse/SOLR-1553

Regards,
Jayendra

On Mon, Mar 21, 2011 at 8:24 AM, Savvas-Andreas Moysidis
savvas.andreas.moysi...@googlemail.com wrote:
 Hello,

 The Dismax search handler doesn't have the concept of a logical operator in
 terms of OR-AND but rather uses a feature called Min-Should-Match (or mm).
 This parameter specifies the absolute number or percentage of the entered
 terms that you need them to match. To have an OR-like effect you can specify
 an mm=0% and for AND-like an mm=100% should work.

 More information can be found here:
 http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

 On 21 March 2011 11:46, Gastone Penzo gastone.pe...@gmail.com wrote:

 Hi.
 i have a problem with logic operator OR in dismax query search.
 some days ago the query worked well. now it returns me anything (0
 documents)

 i explain:

 the query is:
 http://localhost:8983/solr/select/?q=
 1324OR4322OR2324OR%20hello+worlddefType=dismaxqf=code%20title

 the schema has the fields:
 code
 title

 i want to search the docs with hello world in the title, plus the docs with
 the codes 1324,4322,2324 (even if they don't have hello world in the
 title).
 the result is the query returns to me the docs with these codes AND hello
 world in the title (logic AND, not OR)

 the default operator in the schema is OR

 what's happened??

 thank you



 --
 Gastone Penzo

 *www.solr-italia.it*
 The first italian blog dedicated to Apache Solr

Re: SOLR DIH importing MySQL text column as a BLOB

2011-03-16 Thread Jayendra Patil

Hi Kaushik,

If the field is being treated as blobs, you can try using the
FieldStreamDataSource mapping.
This handles the blob objects to extract contents from it.

This feature is available only after Solr 3.1, I suppose.
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/FieldStreamDataSource.html

Regards,
Jayendra

On Tue, Mar 15, 2011 at 11:57 PM, Kaushik Chakraborty
kaych...@gmail.com wrote:
 I've a column for posts in MySQL of type `text`, I've tried corresponding
 `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But
 whenever I'm importing it using the DIH, it's getting imported as a BLOB
 object. I checked, this thing is happening only for columns of type `text`
 and not for `varchar`(they are getting indexed as string). Hence, the posts
 field is not becoming searchable.

 I found about this issue, after repeated search failures, when I did a `*:*`
 query search on Solr. A sample response:

        result name=response numFound=223 start=0 maxScore=1.0
        doc
        float name=score1.0/float
        str name=solr_post_bio[B@10a33ce2/str
        date name=solr_post_created_at2011-02-21T07:02:55Z/date
        str name=solr_post_emailtest.acco...@gmail.com/str
        str name=solr_post_first_nameTest/str
        str name=solr_post_last_nameAccount/str
        str name=solr_post_message[B@2c93c4f1/str
        str name=solr_post_status_message_id1/str
        /doc

 The `data-config.xml` :

        document
     entity name=posts dataSource=jdbc  query=select
     p.person_id as solr_post_person_id,
     pr.first_name as solr_post_first_name,
     pr.last_name as solr_post_last_name,
     u.email as solr_post_email,
     p.message as solr_post_message,
     p.id as solr_post_status_message_id,
     p.created_at as solr_post_created_at,
     pr.bio as solr_post_bio
     from posts p,users u,profiles pr where p.person_id = u.id and
 p.person_id = pr.person_id and p.type='StatusMessage'
             field column=solr_post_person_id /
     field column=solr_post_first_name/
     field column=solr_post_last_name /
     field column=solr_post_email /
     field column=solr_post_message /
     field column=solr_post_status_message_id /
     field column=solr_post_created_at /
     field column=solr_post_bio/
           /entity
      /document

 The `schema.xml` :

    fields
        field name=solr_post_status_message_id type=string
 indexed=true stored=true required=true /
     field name=solr_post_message type=text_ws indexed=true
 stored=true required=true /
     field name=solr_post_bio type=text indexed=false stored=true
 /
     field name=solr_post_first_name type=string indexed=false
 stored=true /
     field name=solr_post_last_name type=string indexed=false
 stored=true /
     field name=solr_post_email type=string indexed=false
 stored=true /
     field name=solr_post_created_at type=date indexed=false
 stored=true /
    /fields
    uniqueKeysolr_post_status_message_id/uniqueKey
    defaultSearchFieldsolr_post_message/defaultSearchField


 Thanks,
 Kaushik

Re: docBoost

2011-03-09 Thread Jayendra Patil

you can use the ScriptTransformer to perform the boost calcualtion and addition.
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

dataConfig
script![CDATA[
function f1(row)  {
// Add boost
row.put('$docBoost',1.5);
return row;
}
]]/script
document
entity name=e pk=id transformer=script:f1
query=select * from X

/entity
/document
/dataConfig

Regards,
Jayendra


On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Anyone have any clue on this on?

 On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:

 Hi all,

 I am using dataimport to create my index and I want to use docBoost to
 assign some higher weights to certain docs. I understand the concept behind
 docBoost but I haven't been able to find an example anywhere that shows how
 to implement it. Assuming the following config file:

 document
    entity name=animal
               dataSource=animals
               pk=id
               query=SELECT * FROM animals
     field column=id name=id /
     field column=genus name=genus /
     field column=species name=species /
     entity name=boosters
                dataSource=boosts
                query=SELECT boost_score FROM boosts WHERE animal_id = ${
 animal.id}
       field column=boost_score name=boost_score /
     /entity
   /entity
 /document

 How do I add in a docBoost score? The boost score is currently in a
 separate table as shown above.

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jayendra Patil

queryNorm is just a normalizing factor and is the same value across
all the results for a query, to just make the scores comparable.
So even if it varies in different environment, you should not worried about.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
-
Defination - queryNorm(q) is just a normalizing factor used to make
scores between queries comparable. This factor does not affect
document ranking (since all ranked documents are multiplied by the
same factor), but rather just attempts to make scores from different
queries (or even different indexes) comparable

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,

 I am seeing an issue I do not understand and hope that someone can shed some 
 light on this. The issue is that for a particular search we are seeing a 
 particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.

 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.

 The index is populated exclusively via data import handler queries to a 
 database.

 I have exported the production database as-is to my local development machine 
 so that my local machine and production have access to the self same data.

 I execute a total full-import on both.

 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.

 I ran debugQuery diff to see how the scores were being computed. See appendix 
 at foot of this email.

 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.

 -        0.021368012 = queryNorm (local)
 +        0.009944122 = queryNorm (production)

 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).

 - -2.286596 (local)
 +1.0651637 = (production)

 I cannot explain this difference. The database is the same. The configuration 
 is the same. I have fully imported from scratch on both servers. What am I 
 missing?

 Thank you for your time

 Allistair

 - snip

 APPENDIX - debugQuery=on DIFF

 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411

 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:
           5.520305 = idf(docFreq=65, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
           1.4142135 = tf(termFreq(text:dubai)=2)
           5.520305 = idf(docFreq=65, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -        0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +      0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +        0.15175761 = queryWeight(profile:dubai^2.0), product of:
           2.0 = boost
           7.6305184 = idf(docFreq=7, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
           1.4142135 = tf(termFreq(profile:dubai)=2)
           7.6305184 = idf(docFreq=7, maxDocs=6063)
           0.375 = fieldNorm(field=profile, doc=1551)
 -    0.36931866 = (MATCH) max plus 0.01 times others of:
 -      0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
 -        0.003954251 = queryWeight(text:product^0.1), product of:
 -          0.1 = boost
 +    0.17194802 = (MATCH) max plus 0.01 times others of:
 +      0.00851347 = (MATCH) weight(text:product in 1551), product of:
 +        0.018402064 = queryWeight(text:product), product of:
           1.8505468 = idf(docFreq=2589, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of:
           1.0 = tf(termFreq(text:product)=1)
           1.8505468 = idf(docFreq=2589, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of:
 -        0.1725098 =

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Jayendra Patil

Are you sure you have the same config ...
The boost seems different for the field text - text:dubai^0.1  text:dubai

-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:

Regards,
Jayendra

On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Thanks. Good to know, but even so my problem remains - the end score should 
 not be different and is causing a dramatically different ranking of a 
 document (3 versus 7 is dramatic for my client). This must be down to the 
 scoring debug differences - it's the only difference I can find :(

 On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:

 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.

 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable

 Regards,
 Jayendra

 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,

 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are seeing 
 a particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.

 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.

 The index is populated exclusively via data import handler queries to a 
 database.

 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.

 I execute a total full-import on both.

 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.

 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.

 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.

 -        0.021368012 = queryNorm (local)
 +        0.009944122 = queryNorm (production)

 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).

 - -2.286596 (local)
 +1.0651637 = (production)

 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?

 Thank you for your time

 Allistair

 - snip

 APPENDIX - debugQuery=on DIFF

 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411

 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -    1.3198489 = (MATCH) max plus 0.01 times others of:
 -      0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -        0.011795795 = queryWeight(text:dubai^0.1), product of:
 -          0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +    0.6151879 = (MATCH) max plus 0.01 times others of:
 +      0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +        0.05489459 = queryWeight(text:dubai), product of:
           5.520305 = idf(docFreq=65, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
           1.4142135 = tf(termFreq(text:dubai)=2)
           5.520305 = idf(docFreq=65, maxDocs=6063)
           0.25 = fieldNorm(field=text, doc=1551)
 -      1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -        0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +      0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +        0.15175761 = queryWeight(profile:dubai^2.0), product of:
           2.0 = boost
           7.6305184 = idf(docFreq=7, maxDocs=6063)
 -          0.021368012 = queryNorm
 +          0.009944122 = queryNorm
         4.0466933

Solr Cell DataImport Tika handler broken - fails to index Zip file contents

2011-03-07 Thread Jayendra Patil

Working with the latest Solr Trunk code and seems the Tika handlers
for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler
(TikaEntityProcessor.java) fails to index the zip file contents again.
It just indexes the file names again.
This issue was addressed some time back, late last year, but seems to
have reappeared with the latest code.

I had raised a jira for the Data Import handler part with the patch
and the testcase - https://issues.apache.org/jira/browse/SOLR-2332.
The same fix is needed for the Solr Cell as well.

I can raise a jira and provide the patch for the same, if the above
patch seems good enough.

Regards,
Jayendra

Re: logical relation among filter queries

2011-03-07 Thread Jayendra Patil

you can use the boolean operators in the filter query.

e.g. fq=rating:(PG-13 OR R)

Regards,
Jayendra

On Mon, Mar 7, 2011 at 9:25 PM, cyang2010 ysxsu...@hotmail.com wrote:
 I wonder what is the logical relation among filter queries.  I can't find
 much documentation on filter query.

 for example,  i want to find all titles that is either PG-13 or R through
 filter query.   The following query won't give me any result back.  So I
 suppose by default it is intersection among each filter query result?

 fq=rating:PG-13fq=rating:Rq=*:*


 How do i change it to union to include value for each filter query result?

 Thanks.






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/logical-relation-among-filter-queries-tp2649142p2649142.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: adding a document using curl

2011-03-03 Thread Jayendra Patil

If you are using the ExtractingRequestHandler, you can also try using
the stream.file or stream.url.

e.g. curl 
http://localhost:8080/solr/core0/update/extract?stream.file=C:/777045.zipliteral.id=777045literal.title=Testcommit=true;

More detailed explaination @
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika

The literal prefix attributes with normal fields and the content
extracted from the document is stored in the text field by default

Regards,
Jayendra

On Thu, Mar 3, 2011 at 7:16 AM, Gary Taylor g...@inovem.com wrote:
 As an example, I run this in the same directory as the msword1.doc file:

 curl
 http://localhost:8983/solr/core0/update/extract?literal.docid=74literal.type=5;
 -F file=@msword1.doc

 The type literal is just part of my schema.

 Gary.


 On 03/03/2011 11:45, Ken Foskey wrote:

 On Thu, 2011-03-03 at 12:36 +0100, Markus Jelsma wrote:

 Here's a complete example

 http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL

 I should have been clearer.   A rich text document,  XML I can make work
 and a script is in the example docs folder

 http://wiki.apache.org/solr/ExtractingRequestHandler

 I also read the solr 1.4 book and tried samples in there,   could not
 make them work.

 Ta

Re: solr different sizes on master and slave

2011-03-02 Thread Jayendra Patil

Hi Mike,

There was an issue with the Snappuller wherein it fails to clean up
the old index directories on the slave side.
https://issues.apache.org/jira/browse/SOLR-2156

The patch can be applied to fix the issue.
You can also delete the old index directories, except for the current
one which is mentioned in the index.properties.

Regards,
Jayendra

On Tue, Mar 1, 2011 at 4:27 PM, Mike Franon kongfra...@gmail.com wrote:
 ok doing some more research I noticed, on the slave it has multiple
 folders where it keeps them for example

 index
 index.20110204010900
 index.20110204013355
 index.20110218125400

 and then there is an index.properties that shows which index it is using.

 I am just curious why does it keep multiple copies?  Is there a
 setting somewhere I can change to only keep one copy so not to lose
 space?

 Thanks

 On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon kongfra...@gmail.com wrote:
 No pending commits, what it looks like is there are almost two copies
 of the index on the master, not sure how that happened.



 On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 Are there pending commits on the master?

 I was curious why would the size be dramatically different even though
 the index versions are the same?

 One is 1.2 Gb, and on the slave it is 512 MB

 I would think they should both be the same size no?

 Thanks

Re: Groupped results

2011-03-02 Thread Jayendra Patil

Hi Rok,

If I understood the use case rightly, Grouping of the results are
possible in Solr http://wiki.apache.org/solr/FieldCollapsing
Probably, you can create new fields with the combination for the
groups and use the field collapsing feature to group the results.

Id Type1Type2Title Group1
1abxfgab
2acabd   ac
3adthm  ad
4baefd   ba
5bbikjbb
6bcazd  bc

It also provides the sort and a group sorting features.

Regards,
Jayendra

On Wed, Mar 2, 2011 at 6:37 AM, Rok Rejc rokrej...@gmail.com wrote:
 I have an index with a number of documents. For example (this example is
 representative and contains many others fields):

 Id    Type1    Type2    Title
 1    a    b    xfg
 2    a    c    abd
 3    a    d    thm
 4    b    a    efd
 5    b    b    ikj
 6    b    c    azd
 ...

 I want to query an index on a number of fields (not a problem), but I want
 to get results ordered in  a groups, and after that (inside the group) I
 want to order result alphabeticaly by a Title.

 Group is not fixed but it is created in runtime. For example:
 - Group 1: documents with Type1=b and Type2=b.
 - Group 2: documents with Type1=a and Type2=b.
 - Group 3: documents with Type1=b and Type2=a.
 - Group 4: documents with Type1=b and Type2=c.
 ...

 So I want to retrieve results ordered by group (1,2,3,4) and after that
 alphabeticaly by a Title.

 I think that I should create a query where each group is seperated with OR
 operator, and boost each group with appropriate factor. After that I should
 order the results by this factor and title.

 Is this possible? Any suggestions are appreciated.

 Many thanks,

 Rok

Re: solr score issue

2011-02-25 Thread Jayendra Patil

Check the Need help in understanding output of searcher.explain()
function thread.

http://mail-archives.apache.org/mod_mbox/lucene-java-user/201008.mbox/%3CAANLkTi=m9a1guhrahpeyqaxhu9gta9fjbnr7-8-zi...@mail.gmail.com%3E

Regards,
Jayendra

On Fri, Feb 25, 2011 at 6:57 AM, Bagesh Sharma mail.bag...@gmail.com wrote:

 Hi sir ,

 Can anyone explain me how this score is being calculated. i am searching
 here software engineer using dismax handler. Total documents indexed are
 477 and query results are 28.

 Query is like that -
       q=software+engineerfq=location%3Adelhi

 dismax setting is -

       str name=qf
             alltext
             title^2
             functional_role^1
        /str

        str name=pf
              body^100
        /str


 Here alltext field is made by copying all fields.
 body field contains detail of job.

 I am unable to understand how these scores have been calculated. From where
 to start score calculating and what are default score for any term matching.

 str name=20080604/3eb9a7b30131a782a0c0a0e2cdb2b6b8.html

 0.5901718 = (MATCH) sum of:
  0.0032821721 = (MATCH) sum of:
    0.0026574256 = (MATCH) max plus 0.1 times others of:
      0.0026574256 = (MATCH) weight(alltext:softwar in 339), product of:
        0.0067262817 = queryWeight(alltext:softwar), product of:
          3.6121683 = idf(docFreq=34, maxDocs=477)
          0.0018621174 = queryNorm
        0.39508092 = (MATCH) fieldWeight(alltext:softwar in 339), product
 of:
          1.0 = tf(termFreq(alltext:softwar)=1)
          3.6121683 = idf(docFreq=34, maxDocs=477)
          0.109375 = fieldNorm(field=alltext, doc=339)
    6.2474643E-4 = (MATCH) max plus 0.1 times others of:
      6.2474643E-4 = (MATCH) weight(alltext:engin in 339), product of:
        0.0032613424 = queryWeight(alltext:engin), product of:
          1.7514161 = idf(docFreq=224, maxDocs=477)
          0.0018621174 = queryNorm
        0.19156113 = (MATCH) fieldWeight(alltext:engin in 339), product of:
          1.0 = tf(termFreq(alltext:engin)=1)
          1.7514161 = idf(docFreq=224, maxDocs=477)
          0.109375 = fieldNorm(field=alltext, doc=339)
  0.5868896 = weight(body:softwar engin^100.0 in 339), product of:
    0.9995919 = queryWeight(body:softwar engin^100.0), product of:
      100.0 = boost
      5.3680387 = idf(body: softwar=34 engin=223)
      0.0018621174 = queryNorm
    0.58712924 = fieldWeight(body:softwar engin in 339), product of:
      1.0 = tf(phraseFreq=1.0)
      5.3680387 = idf(body: softwar=34 engin=223)
      0.109375 = fieldNorm(field=body, doc=339)
 /str


 please suggest me.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-score-issue-tp2574680p2574680.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: query slop issue

2011-02-24 Thread Jayendra Patil

qs is only the amount of slop on phrase queries explicitly specified
in the q for qf fields.
So only if the search q is water treatment plant, would the qs come
into picture.

Slop is the maximum allowable positional distance between terms to be
considered a match is called slop.
and distance is the number of positional moves of terms to reconstruct
the phrase in same order.

So with qs=1 you are allowed for only one positional move to recreate
the exact phrase.

You may also want to check the pf and the ps params for the dismax.

Regards,
Jayendra

On Thu, Feb 24, 2011 at 8:31 AM, Bagesh Sharma mail.bag...@gmail.com wrote:

 Hi all, i have a search string q=water+treatment+plant  and i am using dismax
 request handler where i have qs = 1 . in which way processing will be done
 means with in how many words water or treatment or plant should occur to
 come in result set.


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/query-slop-issue-tp2567418p2567418.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem in full query searching

2011-02-24 Thread Jayendra Patil

With dismax or extended dismax parser you should be able to achieve this.

Dismax :- qf, qs, pf  ps should help you to have exact control on the
fields and boosts.
Extended Dismax :- In addition to qf, qs, pf  ps, you have pf2 and
pf3 for the two and three words shingles.

As Grijesh mentioned, use more weight for phrase or proximity matches

Regards,
Jayendra

On Thu, Feb 24, 2011 at 4:03 AM, Grijesh pintu.grij...@gmail.com wrote:

 Try to configue more waight on ps and pf parameters of dismax request
 handler to boost phrase matching documents.

 Or if you do not want to consider the term frequency then use
 omitTermFreqAndPositions=true in field definition

 -
 Thanx:
 Grijesh
 http://lucidimagination.com
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Problem-in-full-query-searching-tp2566054p2566230.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index MS office

2011-02-02 Thread Jayendra Patil

http://wiki.apache.org/solr/ExtractingRequestHandler

Regards,
Jayendra

On Wed, Feb 2, 2011 at 10:49 AM, Thumuluri, Sai
sai.thumul...@verizonwireless.com wrote:
 Good Morning,

  I am planning to get started on indexing MS office using ApacheSolr -
 can someone please direct me where I should start?

 Thanks,
 Sai Thumuluri

Re: configure httpclient to access solr with user credential on third party host

2011-01-27 Thread Jayendra Patil

This should help 

HttpClient client = new HttpClient();
client.getParams().setAuthenticationPreemptive(true);
AuthScope scope = new AuthScope(AuthScope.ANY_HOST,AuthScope.ANY_PORT);
client.getState().setCredentials(scope, new
UsernamePasswordCredentials(user, password));

Regards,
Jayendra

On Thu, Jan 27, 2011 at 4:47 PM, Darniz rnizamud...@edmunds.com wrote:

 thanks exaclty i asked my domain hosting provider and he provided me with
 some other port

 i am wondering can i specify credentials without the port

 i mean when i open the browser and i type
 www.mydomainmame/solr i get the tomcat auth login screen.

 in the same way can i configure the http client so that i dont have to
 specify the port

 Thanks
 darniz
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/configure-httpclient-to-access-solr-with-user-credential-on-third-party-host-tp2360364p2364190.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor g...@inovem.com wrote:

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as
text files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I index
either of them individually using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@doc1.txt

and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F file=@solr1.zip

and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used
standalone with the tika-app jar - it outputs both the filenames and
contents. Should I be able to index the contents of files stored in a zip
by using extract ?

Thanks and kind regards,
Gary.

On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk
code.

Now I'm getting an error when trying to access the admin page (via Jetty)
because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but
this appears to be no-longer supplied as part of the build so I get an
exception cos it can't find that class. I've checked the CHANGES.txt and
found the following in the change list to 1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory
deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an
arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does anyone
have an example fieldType stanza (for schema.xml) for stripping out HTML ?

Thanks and kind regards,
Gary.

On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file.
Title and other kinds of metadata are successfully extracted by the old 0.4
version of Tika, but you need a newer Tika version (0.8) in order to fetch
the main content as well. So try the newest Solr version from trunk.

Erlend

Re: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-12 Thread Jayendra Patil

Have used edismax and Stopword filters as well. But usually use the fq
parameter e.g. fq=title:the life and never had any issues.

Can you turn on the debugQuery and check whats the Query formed for all the
combinations you mentioned.

Regards,
Jayendra

On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote:

 I'm running into a problem with StopFilterFactory in conjunction with
 (e)dismax queries that have a mix of fields, only some of which use
 StopFilterFactory.  It seems that if even 1 field on the qf parameter does
 not use StopFilterFactory, then stop words are not removed when searching
 any fields.  Here's an example of what I mean:

 - I have 2 fields indexed:
   Title is textStemmed, which includes StopFilterFactory (see below).
   Contributor is textSimple, which does not include StopFilterFactory
 (see below).
 - The is a stop word in stopwords.txt
 - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

 It seems as if the stop words are not being stripped from the query because
 qf contains a field that doesn't use StopFilterFactory.  I did testing
 with combining Stemmed fields with not Stemmed fields in qf and it seems
 as if stemming gets applied regardless.  But stop words do not.

 Does anyone have ideas on what is going on?  Is this a feature or possibly
 a bug?  Any known workarounds?  Any advice is appreciated.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311
 
 fieldType name=textSimple class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 fieldType name=textStemmed class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

Re: solr wildcard queries and analyzers

2011-01-12 Thread Jayendra Patil

Had the same issues with international characters and wildcard searches.

One workaround we implemented, was to index the field with and without the
ASCIIFoldingFilterFactory.
You would have an original field and one with english equivalent to be used
during searching.

Wildcard searches with english equivalent or international terms would match
either of those.
Also, lowere case the search terms if you are using lowercasefilter during
indexing.

Reagrds,
Jayendra

On Wed, Jan 12, 2011 at 7:46 AM, Kári Hreinsson k...@gagnavarslan.iswrote:

Have you made any progress? Since the AnalyzingQueryParser doesn't inherit
from QParserPlugin solr doesn't want to use it but I guess we could
implement a similar parser that does inherit from QParserPlugin?

Switching parser seems to be what is needed? Has really no one solved this
before?

- Kári

- Original Message -
From: Matti Oinas matti.oi...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers

This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas matti.oi...@gmail.com:
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.

2011/1/11 Matti Oinas matti.oi...@gmail.com:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

On wildcard and fuzzy searches, no text analysis is performed on the
search word.

2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
Hi,

I am having a problem with the fact that no text analysis are performed
on wildcard queries. I have the following field type (a bit simplified):
fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ASCIIFoldingFilterFactory /
/analyzer
/fieldType

My problem has to do with Icelandic characters, when I index a document
with a text field including the word sjálfsögðu it gets indexed as
sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the
Icelandic characters with their English equivalents). Then, when I search
(without a wildcard) for sjálfsögðu or sjalfsogdu I get that document as
a result. This is convenient since it enables people to search without
using accented characters and yet get the results they want (e.g. if they
are working on computers with English keyboards).

However this all falls apart when using wildcard searches, then the
search string isn't passed through the filters, and even if I search for
sjálf* I don't get any results because the index doesn't contain the
original words (I get result if I search for sjalf*). I know people have
been having a similar problem with the case sensitivity of wildcard queries
and most often the solution seems to be to lowercase the string before
passing it on to solr, which is not exactly an optimal solution (yet a
simple one in that case). The Icelandic characters complicate things a bit
and applying the same solution (doing the lowercasing and character mapping)
in my application seems like unnecessary duplication of code already part of
solr, not to mention complication of my application and possible maintenance
down the road.

Is there any way around this? How are people solving this? Is there a
way to apply the filters to wildcard queries? I guess removing the
ASCIIFoldingFilterFactory is the simplest solution but this
normalization (of the text done by the filter) is often very useful.

I hope I'm not overlooking some obvious explanation. :/

Thanks in advance,
Kári Hreinsson

Re: Can't find source or jar for Solr class JaspellTernarySearchTrie

2011-01-12 Thread Jayendra Patil

Checkout and build the code from -
https://svn.apache.org/repos/asf/lucene/dev/trunk/

Class -
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java
Regards,
Jayendra

On Wed, Jan 12, 2011 at 8:46 AM, Larry White ljw1...@gmail.com wrote:

 Hi,

 I'm trying to find the source code for class: JaspellTernarySearchTrie.
 It's
 supposed to be used for spelling suggestions.

 It's referenced in the javadoc:

 http://lucene.apache.org/solr/api/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.html

 I realize this is a dumb question, but i've been looking through the
 downloads for several hours.  I can't actually find the
 package org/apache/solr/spelling/suggest/ that it's supposed to be under.

 So if you would be so kind...
 What jar is it compiled into?
 Where is the source in the downloaded source tree?

 thanks.

Re: Failover setup (is this a bad idea)

2010-11-30 Thread Jayendra Patil

Rather have a Master and multiple Slave combination, with master only being
used for writes and slaves used for reads.
Master to Slave replication is easily configurable.

Two Solr instances sharing the same index is not at all good idea with both
writing to the same index.

Regards,
Jayendra

On Tue, Nov 30, 2010 at 7:13 AM, Keith Pope 
keith.p...@inflightproductions.com wrote:

 Hi,

 I have a windows cluster that I would like to install Solr onto, there are
 two nodes that provide basic failover. I was thinking of this setup:

 Tomcat installed as win service
 Two solr instances sharing the same index

 The second instance would take over when the first fails, so you should
 never get two writes/reads at once.

 Is this a bad idea? Would I end up corrupting my index?

 Thx

 Keith



 -
 Registered Office: 15 Stukeley Street, London WC2B 5LT, England.
 Registered in England number 1421223

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited. Please note that
 the information provided in this e-mail is in any case not legally binding;
 all committing statements require legally binding signatures.


 http://www.inflightproductions.com

Re: Extracting and indexing content from multiple binary files into a single Solr document

2010-11-17 Thread Jayendra Patil

The way we implemented the same scenario is zipping all the attachments into
a single zip file which can be passed to the ExtractingRequestHandler for
indexing and included as a part of single Solr document.

Regards,
Jayendra

On Wed, Nov 17, 2010 at 6:27 AM, Gary Taylor g...@inovem.com wrote:

 Hi,

 We're trying to use Solr to replace a custom Lucene server.  One
 requirement we have is to be able to index the content of multiple binary
 files into a single Solr document.  For example, a uniquely named object in
 our app can have multiple attached-files (eg. Word, PDF etc.), and we want
 to index (but not store) the contents of those files in the single Solr doc
 for that named object.

 At the moment, we're issuing HTTP requests direct from ColdFusion and using
 the /update/extract servlet, but can only specify a single file on each
 request.

 Is the best way to achieve this to extend ExtractingRequestHandler to allow
 multiple binary files and thus specify our own RequestHandler, or would
 using the SolrJ interface directly be a better bet, or am I missing
 something fundamental?

 Thanks and regards,
 Gary.

basic authentication for schema.url

2010-11-16 Thread Jayendra Patil

We intend to use schema.url for indexing documents. However, the remote urls
are secured and would need basic authentication to be able access the
document.

The implementation with stream.file would mean to download the files and
would cause duplicity, whereas stream.body would have indexing performance
issues with the hugh data being transferred over the network.

The current implementation for stream.url in ContentStreamBase.URLStream
does not support authentication.
But can be easily supported by :-
1. Passing additional authentication parameter e.g. stream.url.auth with the
encoded authentication value - SolrRequestParsers
2. Setting Authorization request property for the Connection -
ContentStreamBase.URLStream
this.conn.setRequestProperty(Authorization, Basic  +
encodedauthentication);

Any suggestions ???

Regards,
Jayendra

Re: basic authentication for schema.url

2010-11-16 Thread Jayendra Patil

I meant stream.url

Regards,
Jayendra

On Tue, Nov 16, 2010 at 5:37 PM, Jayendra Patil 
jayendra.patil@gmail.com wrote:

 We intend to use schema.url for indexing documents. However, the remote
 urls are secured and would need basic authentication to be able access the
 document.

 The implementation with stream.file would mean to download the files and
 would cause duplicity, whereas stream.body would have indexing performance
 issues with the hugh data being transferred over the network.

 The current implementation for stream.url in ContentStreamBase.URLStream
 does not support authentication.
 But can be easily supported by :-
 1. Passing additional authentication parameter e.g. stream.url.auth with
 the encoded authentication value - SolrRequestParsers
 2. Setting Authorization request property for the Connection -
 ContentStreamBase.URLStream
 this.conn.setRequestProperty(Authorization, Basic  +
 encodedauthentication);

 Any suggestions ???

 Regards,
 Jayendra

Re: Multiple Word Facets

2010-10-27 Thread Jayendra Patil

The Shingle Filter Breaks the words in a sentence into a combination of 2/3
words.

For faceting field you should use :-
field name=facet_field *type=string* indexed=true stored=true
multiValued=true/

The type of the field should be *string *so that it is not tokenised at all.

On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada estrada.a...@gmail.comwrote:

 Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
 terms per facet but now I am seeing some redundancy in the facets
 numbers. See below...

 Highway (62)
 Highway System (59)
 National (59)
 National Highway (59)
 National Highway System (59)
 System (59)

 See what's going on here? How can I make my multi token facets smarter
 so that the tokens aren't duplicated?

 Thanks in advance,
 Adam

 On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote:
  Facets are generated from indexed terms.
 
  Depending on your need/use-case:
 
  You can use a additional separate String field (which is not tokenized)
 for facets, populate it via copyField. Search on tokenized field facet on
 non-tokenized field.
 
  Or
 
  You can add solr.ShingleFilterFactory to your index analyzer to form
 multiple word terms.
 
  --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote:
 
  From: Adam Estrada estrada.a...@gmail.com
  Subject: Multiple Word Facets
  To: solr-user@lucene.apache.org
  Date: Wednesday, October 27, 2010, 4:43 AM
  All,
  I am a new to Solr faceting and stuck on how to get
  multiple-word
  facets returned from a standard Solr query. See below for
  what is
  currently being returned.
 
  lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
  lst name=title
  int name=Federal89/int
  int name=EFLHD87/int
  int name=Eastern87/int
  int name=Lands87/int
  int name=Highways84/int
  int name=FHWA60/int
  int name=Transportation32/int
  int name=GIS22/int
  int name=Planning19/int
  int name=Asset15/int
  int name=Environment15/int
  int name=Management14/int
  int name=Realty12/int
  int name=Highway11/int
  int name=HEP10/int
  int name=Program9/int
  int name=HEPGIS7/int
  int name=Resources7/int
  int name=Roads7/int
  int name=EEI6/int
  int name=Environmental6/int
  int name=Right6/int
  int name=Way6/int
  ...etc...
 
  There are many terms in there that are 2 or 3 word phrases.
  For
  example, Eastern Federal Lands Highway Division all gets
  broken down
  in to the individual words that make up the total group of
  words. I've
  seen quite a few websites that do what it is I am trying to
  do here so
  any suggestions at this point would be great. See my schema
  below
  (copied from the example schema).
 
  fieldType name=text
  class=solr.TextField positionIncrementGap=100
analyzer type=index
   tokenizer
  class=solr.WhitespaceTokenizerFactory/
  filter
  class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=false/
  filter
  class=solr.StopFilterFactory
 
  ignoreCase=true
 
  words=stopwords.txt
 
  enablePositionIncrements=true
 
  /
  filter
  class=solr.WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1 catenateWords=0
  catenateNumbers=0
  catenateAll=0 splitOnCaseChange=1/
  filter
  class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
 
  Similar for type=query. Please advise on how to group or
  cluster
  document terms so that they can be used as facets.
 
  Many thanks in advance,
  Adam Estrada

Re: after the slave node pull index from master, when will solr del the tmp index dir

2010-10-27 Thread Jayendra Patil

We faced the same issue.
If you are executing a complete clean build, the Slave copies the complete
index and just switches the pointer in the index.properties to point to the
new index. directory, leaving behind the old copies. And it does not
clean it up.

Had logged an JIRA and patch to SnapPuller class, you may want to give it a
try -
https://issues.apache.org/jira/browse/SOLR-2156

Regards,
Jayendra

2010/10/26 Chengyang atreey...@163.com

 I noticed that the slave node have some tmp Index.x dir that created
 during the index sync with master, but they are not removed even after
 serval days. So when will solr del the tmp index dir?

Re: Solr ExtractingRequestHandler with Compressed files

2010-10-25 Thread Jayendra Patil

There was this issue with the previous version of Solr, wherein only the
file names from the zip used to get indexed.
We had faced the same issue and ended up using the Solr trunk which has the
Tika version upgraded and works fine.

The Solr version 1.4.1 should also have the fix included. Try using it.

Regards,
Jayendra

On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.comwrote:

Hi,

Has anyone had success using ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) ?

I am sending solr the archived.tar file using curl. curl

http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true

-H 'Content-type:application/octet-stream' --data-binary
@/home/archived.tar
The result I get when I query the document is that the filenames inside the
archive are indexed as the body_texts, but the content of those files is
not extracted or included. This is not the behvior I expected. Ref:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
.
When I send 1 of the actual documents inside the archive using the same
curl
command the extracted content is then stored in the body_texts field. Am
I missing a step for the compressed files?

I have added all the extraction depednenices as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to succesfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions.
Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file. Any help or suggestions
would
be appreciated.

Re: Solr sorting problem

2010-10-21 Thread Jayendra Patil

need additional information .
Sorting is easy in Solr just by passing the sort parameter

However, when it comes to text sorting it depends on how you analyse
and tokenize your fields
Sorting does not work on fields with multiple tokens.
http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F

On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan moazz...@gmail.com wrote:

 Hey guys,

 I have a list of people indexed in Solr. I am trying to sort by their
 first names but I keep getting results that are not alphabetically
 sorted (I see the names starting with W before the names starting with
 A). I have a feeling that the results are first being sorted by
 relevancy then sorted by first name.

 Is there a way I can get the results to be sorted alphabetically?

 Thanks,
 Moazzam

Re: /update/extract

2010-08-21 Thread Jayendra Patil

The Extract Request Handler invokes the classes from the extraction package.

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java

This is package into the apache-solr-cell jar.

Regards,
Jayendra*

*
On Thu, Aug 19, 2010 at 10:04 AM, satya swaroop sswaro...@gmail.com wrote:

 Hi all,
   when we handle extract request handler what class gets invoked.. I
 need to know the navigation of classes when we send any files to solr.
 can anybody tell me the classes or any sources where i can get the answer..
 or can anyone tell me what classes get invoked when we start the
 solr... I be thankful if anybody can help me with regarding this..

 Regards,
 satya

Re: How to compile nightly build?

2010-08-13 Thread Jayendra Patil

yup, The Nightly build you pointed out has pre-built code and does the
include the lucene and module dependencies needed for compilation.
In case you want to compile from the source 
You can check the code from the location @
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr
There are 3 Folder - Solr, Lucene and Modules

If you are making changes in any of the folders :-

From the modules folder execute - ant compile
From the lucene folder execute - ant dist
From the solr folder execute - ant dist


Would require jdk 1.6

Regards,
Jayendra

On Fri, Aug 13, 2010 at 7:11 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 The nightly test artifacts don't currently contain everything needed to
 recompile the sources, this is a known issue...

   https://issues.apache.org/jira/browse/SOLR-1989

 ...if you want to compile from source off hte trunk or 3x branch, you need
 to check out the *entire* branch (not just the solr directory, but it's
 parent including lucene and the modules)

 This is the problem with the source in the nightly artifacts at the
 moment: they only include the solr source.



 -Hoss

Re: diacritics on query string

2010-08-13 Thread Jayendra Patil

*ASCIIFoldingFilter *is probably the filter known to replace the assented
chars to normal ones. However i don't see that in your config.

For the issue, you can easily debug the issue through solr analysis tool.

Regards,
Jayendra


On Fri, Aug 13, 2010 at 3:20 AM, Andrea Gazzarini 
andrea.gazzar...@atcult.it wrote:

  Hi,
 I have a problem regarding a diacritic character on my query string :

 *q=intertestualità
 *
 which is encoded in

 *q=intertestualit%E0
 *
 What I'm not understanding is the following query response fragments :

 lst  name=responseHeader
  int  name=status0/int
  int  name=QTime23/int
  lst  name=params
  str  name=sortscore desc/str
  str  name=flscore,title/str

  str  name=debugQueryon/str
  str  name=indenton/str
  str  name=start0/str
  *str  name=qintertestualit/str*
  str  name=version2.2/str

  str  name=rows3/str
  /lst

 and

 lst  name=debug
  str  name=rawquerystring*intertestualit*/str
  str  name=querystring*intertestualit*/str

 I saw that my index contains the token intertestualita (with the 'à' char
 replaced with 'a'). Indeed if I query for intertestualita I found my
 results.
 The queried field is configured with the same chain :

 fieldtype name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory
 /
filter
 class=schema.UnicodeNormalizationFilterFactory version=icu4j
 composed=false remove_diacritics=true remove_modifiers=true
 fold=true /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter
 class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=schema.UnicodeNormalizationFilterFactory
 version=icu4j composed=false remove_diacritics=true
 remove_modifiers=true fold=true /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
/fieldtype

 So my question is : who is removing the à (%E0) characters from the input
 query? It seems that the query arrives to SOLR already without that
 character...

 Regards,
 Andrea

Re: Hierarchical faceting

2010-08-12 Thread Jayendra Patil

We were able to get the hierarchy faceting working with a work around
approach.

e.g. if you have Europe//Norway//Oslo as an entry

1. Create a new multivalued field with string type

field name=country_facet type=string indexed=true stored=true
multiValued=true/

2. Index the field for Europe//Norway//Oslo with values

0//Europe
1//Europe//Norway
2//Europe//Norway//Oslo

3. The Facet can now be used in the Queries :-

1st Level - Would return all entries @ 1st level e.g. 0//USA, 0//Europe

fq=

f.country_facet.facet.prefix=0//

facet.field=country_facet


2nd Level - Would return all entries @ second level in Europe
1//Europe//Norway, 1//Europe//Sweden

fq=country_facet:0//Europe

f.country_facet.facet.prefix=1//Europe

facet.field=country_facet



3rd Level - Would return 1//Europe//Norway entries

fq=country_facet:1//Europe//Norway

f.country_facet.facet.prefix=2//Europe//Norway

facet.field=country_facet

Increment the facet.prefix by 1 so that you limit the facet results to to
that prefix.
Also works for any depth.

Regards,
Jayendra


On Thu, Aug 12, 2010 at 6:01 PM, Mats Bolstad mat...@stud.ntnu.no wrote:

 Hey all,

 I am doing a search on hierarchical data, and I have a hard time
 getting my head around the following problem.

 I want a result as follows, in one single query only:

 USA (3)
  California (2)
  Arizona (1)
 Europe (4)
  Norway (3)
  Oslo (3)
  Sweden (1)

 How it looks in the XML/JSON response is not really important, this is
 more a presentation issue. I guess I could store the values USA,
 USA/California, Europe/Norway/Oslo as strings for each document,
 and do some JavaScript-ing to show the hierarchies appropriately. When
 a specific item in the facet is selected, for example Norway, Solr
 could be queries with a filter query on Europe/Norway*?

 Do anyone have some experiences they could please share with me?

 I have tried out SOLR-64, and it gives me the results I look for.
 However, I do not have the opportunity to use a patch in the
 production environment ...

 --
 Thanks,
 Mats Bolstad

Re: edismax pf2 and ps

2010-08-12 Thread Jayendra Patil

We pretty much had the same issue, ended up customizing the ExtendedDismax
code.

In your case its just a change of a single line
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, pslop);
to
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, 0);

Regards,
Jayendra


On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote:

 Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

 I find the 'pf' parameter with a pretty large 'ps' to do a very
 nice job for providing a modest boost to many documents that are
 quite well related to many queries in my system.

 In contrast, I find the 'pf2' parameter with zero 'ps' does
 extremely well at providing a high boost to documents that
 are often exactly what someone's searching for.

 Is there any way I can get both effects?

 Edismax's pf2 parameter is really nice for boosting exact phrases
 in queries like 'black jacket red cap white shoes'.   But as soon
 as even a little phrase slop (ps) is added, it seems like it starts
 boosting documents with red jackets and white caps just as much as
 those with black jackets and red caps.

 My gut feeling is that if I could have pf with a large phrase
 slop and the pf2 with zero phrase slop, it'd give me better overall
 results than any single phrase slop setting that gets applied to both.

 Is there any good way for me to test that?

  Thanks,
   Ron

Re: PDF file

2010-08-10 Thread Jayendra Patil

Try ...

curl 
http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file=
Full_Path_of_File/pub2009001.pdfliteral.id=777045commit=true

stream.file - specify full path
literal.extra params - specify any extra params if needed

Regards,
Jayendra

On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov wrote:

 Thanks so much for your help! I tried to index a pdf file and got the
 following. The command I used is

 curl '
 http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf

 Did I do something wrong? Do I need modify anything in schema.xml or other
 configuration file?

 
 [xiao...@lhcinternal lhc]$ curl '
 http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 404 /title
 /head
 bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre
 pRequestURI=/solr/lhc/update/extract/ppismalla href=
 http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/

 /body
 /html
 ***

 -Original Message-
 From: Sharp, Jonathan [mailto:jsh...@coh.org]
 Sent: Tuesday, August 10, 2010 4:37 PM
 To: solr-user@lucene.apache.org
 Subject: RE: PDF file

 Xiaohui,

 You need to add the following jars to the lib subdirectory of the solr
 config directory on your server.

 (path inside the solr 1.4.1 download)

 /dist/apache-solr-cell-1.4.1.jar
 plus all the jars in
 /contrib/extraction/lib

 HTH

 -Jon
 
 From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
 Sent: Tuesday, August 10, 2010 11:57 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: PDF file

 Does anyone have any experience with PDF file? I really appreciate your
 help!
 Thanks so much in advance.

 -Original Message-
 From: Ma, Xiaohui (NIH/NLM/LHC) [C]
 Sent: Tuesday, August 10, 2010 10:37 AM
 To: 'solr-user@lucene.apache.org'
 Subject: PDF file

 I have a lot of pdf files. I am trying to import pdf files to solr and
 index them. I added ExtractingRequestHandler to solrconfig.xml.

 Please tell me if I need download some jar files.

 In the Solr1.4 Enterprise Search Server book, use following command to
 import a mccm.pdf.

 curl '
 http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

 Please tell me if there is a way to import pdf files from a directory.

 Thanks so much for your help!



 -
 SECURITY/CONFIDENTIALITY WARNING:
 This message and any attachments are intended solely for the individual or
 entity to which they are addressed. This communication may contain
 information that is privileged, confidential, or exempt from disclosure
 under applicable law (e.g., personal health information, research data,
 financial information). Because this e-mail has been sent without
 encryption, individuals other than the intended recipient may be able to
 view the information, forward it to others or tamper with the information
 without the knowledge or consent of the sender. If you are not the intended
 recipient, or the employee or person responsible for delivering the message
 to the intended recipient, any dissemination, distribution or copying of the
 communication is strictly prohibited. If you received the communication in
 error, please notify the sender immediately by replying to this message and
 deleting the message and any accompanying files from your system. If, due to
 the security risks, you do not wish to receive further communications via
 e-mail, please reply to this message and inform the sender that you do not
 wish to receive further e-mail from the sender.

 -

Re: Setting up apache solr in eclipse with Tomcat

2010-08-04 Thread jayendra patil

Have got solr working in the Eclipse and deployed on Tomcat through eclipse
plugin.
The Crude approach, was to

   1. Import the Solr war into Eclipse which will be imported as a web
   project and can be deployed on tomcat.
   2. Add multiple source folders to the Project, linked to the checked out
   solr source code. e.g. entry in .project file
   linkedResources
   link
   namecommon/name
   type2/type
   locationD:/Solr/solr/src/common/location
   /link
   .
   /linkedResources
   3. Remove the solr jars from the web-inf lib, so that changes on the
   project sources can be deployed and debugged.

Let me know if you get a better approach.

On Wed, Aug 4, 2010 at 3:49 AM, Hando420 hando...@gmail.com wrote:


 I would like to setup apache solr in eclipse using tomcat. It is easy to
 setup with jetty but with tomcat it doesn't run solr on runtime. Anyone has
 done this before?

 Hando
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Setting-up-apache-solr-in-eclipse-with-Tomcat-tp1021673p1021673.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Setting up apache solr in eclipse with Tomcat

2010-08-04 Thread jayendra patil

The sole home is configured in the web.xml of the application which points
to the folder having the conf files and the data directory
env-entry
   env-entry-namesolr/home/env-entry-name
   env-entry-valueD:/multicore/env-entry-value
   env-entry-typejava.lang.String/env-entry-type
/env-entry

Regards,
Jayendra

On Wed, Aug 4, 2010 at 12:21 PM, Hando420 hando...@gmail.com wrote:


 Thanks man i haven't tried this but where do put that xml configuration. Is
 it to the web.xml in solr?

 Cheers,
 Hando
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Setting-up-apache-solr-in-eclipse-with-Tomcat-tp1021673p1023188.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-04 Thread jayendra patil

ContentStreamUpdateRequest seems to read the file contents and transfer it
over http, which slows down the indexing.

Try Using StreamingUpdateSolrServer with stream.file param @
http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post

e.g.

SolrServer server = new StreamingUpdateSolrServer(Solr Server URL,20,8);
UpdateRequest req = new UpdateRequest(/update/extract);
ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
params.add(stream.file, new String[]{local file path});
params.set(literal.id, value);
req.setParams(params);
server.request(req);
server.commit();

Regards,
Jayendra

On Wed, Aug 4, 2010 at 3:01 PM, Tod listac...@gmail.com wrote:

 I'm running a slight variation of the example code referenced below and it
 takes a real long time to finally execute.  In fact it hangs for a long time
 at solr.request(up) before finally executing.  Is there anything I can look
 at or tweak to improve performance?

 I am also indexing a local pdf file, there are no firewall issues, solr is
 running on the same machine, and I tried the actual host name in addition to
 localhost but nothing helps.


 Thanks - Tod

 http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

Re: query about qf defaults

2010-08-03 Thread jayendra patil

You can use appends for any additional fq paramters, which would be appended
to the ones passed @ query time.
Check out the sample solrconfig.xml with the solr.

!-- In addition to defaults, appends params can be specified
 to identify values which should be appended to the list of
 multi-val params from the query (or the existing defaults).

 In this example, the param fq=instock:true will be appended to
 any query time fq params the user may specify, as a mechanism for
 partitioning the index, independent of any user selected filtering
 that may also be desired (perhaps as a result of faceted
searching).

 NOTE: there is *absolutely* nothing a client can do to prevent
these
 appends values from being used, so don't use this mechanism
 unless you are sure you always want it.
  --
lst name=appends
  str name=fqinStock:true/str
/lst

Regards,
Jayendra

On Tue, Aug 3, 2010 at 8:25 AM, Robert Neve robert.n...@gmx.co.uk wrote:

 Hi,

 I have in my solr config file the code below to create a default for fq
 which works great. The problem I have is if I want to use a custom fq this
 one gets overwritten. Is there a way I can have it keep this fq and other
 custom ones? Basically this field sets if the person is to show up or not so
 it's important anyone set to d is never shown regardless of any other query
 filters.

  lst name=defaults
  str name=fqss_cck_field_status:d /str

 thanks in advance for any help

 Robert

QueryUtils API Change - Custom ExtendedDismaxQParserPlugin accessing QueryUtils.makeQueryable throws java.lang.IllegalAccessError

2010-08-02 Thread jayendra patil

We have a custom implementation of ExtendedDismaxQParserPlugin, which we
bundle into a jar and have it exposed in the multicore shared lib.
The custom ExtendedDismaxQParserPlugin implementation still uses QueryUtils
makeQueryable method, same as
the ExtendedDismaxQParserPlugin implementation.
However, the method calls throws an java.lang.IllegalAccessError, as it is
being called from the inner ExtendedSolrQueryParser class and the
makeQueryable has no access modifier (basically default)

Any reason for having it with default access modifier ?? or any plans making
it public ???

Regards,
Jayendra

Document Boost with Solr Extraction - SolrContentHandler

2010-07-30 Thread jayendra patil

We are using Solr Extract Handler for indexing document metadata with
attachments. (/update/extract)
However, the SolrContentHandler doesn't seem to support index time document
boost attribute.
Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing.

Regards,
Jayendra

62 matches

Mail list logo