Use a different folder for schema.xml

2012-08-22 Thread Alexander Cougarman
Hi. For our Solr instance, we need to put the schema.xml file in a different 
location than where it resides now. Is this possible? Thanks.

Sincerely,
Alex 



Re: Co-existing solr cloud installations

2012-08-22 Thread Lance Norskog
ZK has a 'chroot' feature (named after the Unix multi-tenancy feature).

http://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#ch_zkSessions
https://issues.apache.org/jira/browse/ZOOKEEPER-237

The last I heard, this feature could work for making a single ZK
cluster support multiple SolrCloud clusters. Has it been proofed out?

On Tue, Aug 21, 2012 at 8:22 PM, Mark Miller markrmil...@gmail.com wrote:
 You can use a connect string of host:port/path to 'chroot' a path. I
 think currently you have to manually create the path first though. See
 the ZkCli tool (doc'd on SolrCloud wiki) for a simple way to do that.

 I keep meaning to look into auto making it if it doesn't exist, but
 have not gotten to it.

 - Mark

 On Tue, Aug 21, 2012 at 4:46 PM, Buttler, David buttl...@llnl.gov wrote:
 Hi all,
 I would like to use a single zookeeper cluster to manage multiple Solr cloud 
 installations.  However, the current design of how Solr uses zookeeper seems 
 to preclude that.  Have I missed a configuration option to set a zookeeper 
 prefix for all of a Solr cloud configuration directories?

 If I look at the zookeeper data it looks like:

  * /clusterstate.json
  * /collections
  * /configs
  * /live_nodes
  * /overseer
  * /overseer_elect
  * /zookeeper
 Is there a reason not to put all of these nodes under some user-configurable 
 higher-level node, such as /solr4?
 It could have a reasonable default value to make it just as easy to find as 
 /.

 My current issue is that I have an old Solr cloud instance from back in the 
 Solr 1.5 days, and I don't expect that the new version and the old version 
 will play nice.

 Thanks,
 Dave




-- 
Lance Norskog
goks...@gmail.com


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Lance Norskog
How do you separate the documents among the shards? Can you set up the
shards such that one collapse group is only on a single shard? That
you never have to do distributed grouping?

On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar

 -Original Message-
 From: Tom Burton-West tburt...@umich.edu
 Date: Tue, 21 Aug 2012 18:39:25
 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
 Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
 Subject: Scalability of Solr Result Grouping/Field Collapsing:
  Millions/Billions of documents?

 Hello all,

 We are thinking about using Solr Field Collapsing on a rather large scale
 and wonder if anyone has experience with performance when doing Field
 Collapsing on millions of or billions of documents (details below. )  Are
 there performance issues with grouping large result sets?

 Details:
 We have a collection of the full text of 10 million books/journals.  This
 is spread across 12 shards with each shard holding about 800,000
 documents.  When a query matches a journal article, we would like to group
 all the matching articles from the same journal together. (there is a
 unique id field identifying the journal).  Similarly when there is a match
 in multiple copies of the same book we would like to group all results for
 the same book together (again we have a unique id field we can group on).
 Sometimes a short query against the OCR field will result in over one
 million hits.  Are there known performance issues when field collapsing
 result sets containing a million hits?

 We currently index the entire book as one Solr document.  We would like to
 investigate the feasibility of indexing each page as a Solr document with a
 field indicating the book id.  We could then offer our users the choice of
 a list of the most relevant pages, or a list of the books containing the
 most relevant pages.  We have approximately 3 billion pages.   Does anyone
 have experience using field collapsing on this sort of scale?

 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Service
 Univerity of Michigan Library
 http://www.hathitrust.org/blogs/large-scale-search
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *



-- 
Lance Norskog
goks...@gmail.com


Re: Does DIH commit during large import?

2012-08-22 Thread Lance Norskog
Solr has a separate feature called 'autoCommit'. This is configured in
solrconfig.xml. You can set Solr to commit all documents every N
milliseconds or every N documents, whichever comes first. If you want
intermediate commits during a long DIH session, you have to use this
or make your own script that does commits.

On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey s...@elyograg.org wrote:
 On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:

 I am doing an import of large records (with large full-text fields)
 and somewhere around 30 records DataImportHandler runs out of
 memory (Heap) on a TIKA import (triggered from custom Processor) and
 does roll-back. I am using store=false and trying some tricks and
 tracking possible memory leaks, but also have a question about DIH
 itself.

 What actually happens when I run DIH on a large (XML Source) job? Does
 it accumulate some sort of status in memory that it commits at the
 end? If so, can I do intermediate commits to drop the memory
 requirements? Or, will it help to do several passes over the same
 dataset and import only particular entries at a time? I am using the
 Solr 4 (alpha) UI, so I can see some of the options there.


 I use Solr 3.5 and a MySQL database for import, so my setup may not be
 completely relevant, but here is my experience.

 Unless you turn on autocommit in solrconfig, documents will not be
 searchable during the import.  If you have commit=true for DIH (which I
 believe is the default), there will be a commit at the end of the import.

 It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
 that is suspected to be a bug in Tika rather than Solr.  The issue details
 talk about some workarounds for those who are familiar with Tika -- I'm not.
 The issue URL:

 https://issues.apache.org/jira/browse/SOLR-2886

 Thanks,
 Shawn




-- 
Lance Norskog
goks...@gmail.com


Re: How to design index for related versioned database records

2012-08-22 Thread Lance Norskog
Another option is to take the minimum time interval and record every
active interval during an employee record. Make a compound key of the
employee and the time range. (Look at the SignatureUpdateProcessor for
how to do this.) Add one multi-valued field that contains all of the
time intervals for which this record is active.

If you make this multi-valued field indexed and not stored, the index
will store one copy of each interval for all of the documents, and
with each interval a list of all documents containing it. This takes a
surprisingly small amount of memory. You do not have to do range
searches or joins, you can just do an OR of all of the intervals you
are looking for.

On Tue, Aug 21, 2012 at 5:20 AM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm, how many employees/services/dates are we talking about
 here? Is the cross product 1M? 1B? 1G records?

 You could try the Solr join stuff (Solr 4x), be aware that it performs
 best on join fields with a limited number of unique values.

 Best
 Erick

 On Tue, Aug 21, 2012 at 4:05 AM, Stefan Burkard sburk...@gmail.com wrote:
 Hi Jack

 Thanks for your answer. Do I understand that correctly that I must
 create a merge-entity that contains all the different
 validFrom/validUntil dates as fields (and of course the other
 search-related fields).

 This would mean that the number of index entries is equal to the
 number of all possible combinations of from/until date-ranges in a
 record-chain (all records with all their individual versions
 connected by foreign keys) since every combination creates a new
 record in a query across all tables. That also means that I will have
 a lot of entries with the same values in the other search-related
 fields - the only difference will be most of the time one of the
 from/until-ranges.
 Perhaps the query can be optimized so that irrelevant combinations can
 be avoided (for example if two date-ranges do not overlap).

 Then, when I have built that index I can query it with the reference
 date as argument to compare it with every from/until range in the
 chain. And so I get only the relevant entries where the reference date
 is between all from/until ranges.

 Is this correct?

 Thanks and regards
 Stephan



 On Wed, Aug 15, 2012 at 2:32 PM, Jack Krupansky j...@basetechnology.com 
 wrote:
 The date checking can be implemented using range query as a filter query,
 such as

 fq=startDate:[* TO NOW] AND endDate:[NOW TO *]

 (You can also use an frange query.)

 Then you will have to flatten the database tables. Your Solr schema would
 have a single merged record type. You will have to decide whether the
 different record types (tables) will have common fields versus static
 qualification by adding a prefix or suffix, e.g., name vs. employee_name
 and employer_name. The latter has the advantage that you do not have to
 separately specify a table type field since the fields would be empty for
 records of other types.

 -- Jack Krupansky

 -Original Message- From: Stefan Burkard
 Sent: Wednesday, August 15, 2012 8:12 AM
 To: solr-user@lucene.apache.org
 Subject: How to design index for related versioned database records


 Hi solr-users

 I have a case where I need to build an index from a database.

 ***Data structure***
 The data is spread across multiple tables and in each table the
 records are versioned - this means that one real record can exist
 multiple times in a table, each with different validFrom/validUntil
 dates. Therefore it is possible to query the valid version of a record
 for a given point in time.

 The relations of the data are something like this:
 Employee - LinkTable (=Employment) - Employer - LinkTable
 (=offered services) - Service

 That means I have data across 5 relations, each of them with versioned
 records.

 ***Search needs***
 Now I need to be able to search for employees and employers based on
 the services they offer for a given point in time.

 Therefore I have built an index of all employees and employers with
 their services as subentity. So I have one index entry for every
 version of every employee/employer and each version collects the
 offered services for the given timeframe of the employee/employer
 version.

 Problem: The offered services of an employee/employer can change
 during its validity period. That means I do not only need to take the
 version timespan of the employee/employer into account but also the
 version timespans of services and the link-tables.

 ***Question***
 I think I could continue with my strategy to have an index entry of an
 employee/employer with its services for any given point in time. But
 there are much more entries than now since every involved
 validfrom/validuntil period (if they overlap) produces more entries.
 But I am not sure if this is a good strategy, or if it would be better
 to try to index the whole datastructure in an other way.

 Are there any recommendations how to handle such a case?

 Thanks for any help
 Stephan



-- 

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

2012-08-22 Thread Lance Norskog
There is no copyField in the schema.  You have to store the parsed
text in a field which is stored! Highlighting works on stored fields.
There is no text field in the schema. I don't know how the DIH
automatically creates it.

On Tue, Aug 21, 2012 at 2:10 PM, anarchos78
rigasathanasio...@hotmail.com wrote:
 Any help? Anyone?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002513.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Use a different folder for schema.xml

2012-08-22 Thread Lance Norskog
It is possible to store the entire conf/ directory somewhere. To store
only the schema.xml file, try soft links or the XML include feature:
conf/schema.xml includes from somewhere else.

On Tue, Aug 21, 2012 at 11:31 PM, Alexander Cougarman acoug...@bwc.org wrote:
 Hi. For our Solr instance, we need to put the schema.xml file in a different 
 location than where it resides now. Is this possible? Thanks.

 Sincerely,
 Alex




-- 
Lance Norskog
goks...@gmail.com


Which directories are required in Solr?

2012-08-22 Thread Alexander Cougarman
Hi. Which folders/files can be deleted from the default Solr package 
(apache-solr-3.6.1.zip) on Windows if all we'd like to do is index/store 
documents? Thanks.

Sincerely,
Alex 



Highlighting is case sensitive when search with double quote

2012-08-22 Thread vrpar...@gmail.com
when i search with abc cde, solr will return result but highlighting
portion is as per below,

lst name=highlighting
lst name=1
/lst
/lst

and when i search with ABC cde it will have below response

lst name=highlighting
lst name=1
arr name=SearchField
str
... ... ABC cde .
/str
/arr
/lst
/lst

seems highlighting returns response is case sensitive. 

in above both case other query parameters are same.

how can i get case insensitive response.

Thanks,
Vishal Parekh





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-is-case-sensitive-when-search-with-double-quote-tp4002576.html
Sent from the Solr - User mailing list archive at Nabble.com.


display SOLR Query in web page

2012-08-22 Thread Bernd Fehling
Now this is very scary, while searching for solr direct access per docid I 
got a hit
from US Homeland Security Digital Library. Interested in what they have to tell 
me
about my search I clicked on the link to the page. First the page had nothing 
unusual
about it, but why I get the hit?
http://www.hsdl.org/?collection/stratpolid=4

Inspecting the page source view shows that they have the solr query displayed 
direct
on their page as span with style=display:none.
-- snippet --
!-- Search Results --

span style=display: none;*** SOLR Query *** mdash; q=Collection:0 AND 
(TabSection:(Congressional hearings and testimony, Congressional
reports, Congressional resolutions, Directives (presidential), Executive 
orders, Major Legislation, Public laws, Reports (CBO),
Reports (CHDS), Reports (CRS),...
...
AND (Title_nostem:(China Forces Senior Intelligence Officer)^10 
AlternateTitle_nostem:(China Forces Senior Intelligence
Officer)^9)sort=score
descrows=30start=0indent=offfacet=onfacet.limit=1facet.mincount=1fl=AlternateTitle_text,Collection,CoverageCountry,CoverageState,Creator_nostem,DateLastModified,DateOfRecordEntry,Description_text,DisplayDate,DocID,ExternalDocId,ExternalDocSource,FileDate,FileExtension,FileSize,FileTitle_text,Format,Language,PublishDate,Publisher_text,Publisher_nostem,ReportNumber,ResourceType,RetrievedFrom,Rights,Subjects,Source,TabSection,Title_text,URL_text,Alternate_URL_text,CreatedBy,ModifiedBy,Noteswt=phpsfacet.field=Creatorfacet.field=Formatfacet.field=Languagefacet.field=Publisherfacet.field=TabSection/span
-- snippet --

As you can see I have searched for China Forces Senior Intelligence Officer 
so this is directly showing the
query string.
Do they know that there is also a delete by query?
And the are also escape sequences?

This is what I call scary.
Maybe some of the US fellows can give them a hint and a helping hand.

Regards
Bernd


RE: Use a different folder for schema.xml

2012-08-22 Thread Alexander Cougarman
Thanks, Lance. Please forgive my ignorance, but what do you mean by soft 
links/XML include feature? Can you provide an example? Thanks again.

Sincerely,
Alex 

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: 22 August 2012 9:55 AM
To: solr-user@lucene.apache.org
Subject: Re: Use a different folder for schema.xml

It is possible to store the entire conf/ directory somewhere. To store only the 
schema.xml file, try soft links or the XML include feature:
conf/schema.xml includes from somewhere else.

On Tue, Aug 21, 2012 at 11:31 PM, Alexander Cougarman acoug...@bwc.org wrote:
 Hi. For our Solr instance, we need to put the schema.xml file in a different 
 location than where it resides now. Is this possible? Thanks.

 Sincerely,
 Alex




--
Lance Norskog
goks...@gmail.com


Re: Use a different folder for schema.xml

2012-08-22 Thread Ravish Bhagdev
You can include one xml file into another, something like


   1. ?xml version='1.0' encoding='utf-8'?
   2. !DOCTYPE document [ !ENTITY  resourcedb SYSTEM
   3. 'file:/some/absolute/path/a.xml' ]
   4. resource
   5. childofbresourcedb;childofb
   6. /resource


- Ravish

On Wed, Aug 22, 2012 at 8:56 AM, Alexander Cougarman acoug...@bwc.orgwrote:

 Thanks, Lance. Please forgive my ignorance, but what do you mean by soft
 links/XML include feature? Can you provide an example? Thanks again.

 Sincerely,
 Alex

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: 22 August 2012 9:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Use a different folder for schema.xml

 It is possible to store the entire conf/ directory somewhere. To store
 only the schema.xml file, try soft links or the XML include feature:
 conf/schema.xml includes from somewhere else.

 On Tue, Aug 21, 2012 at 11:31 PM, Alexander Cougarman acoug...@bwc.org
 wrote:
  Hi. For our Solr instance, we need to put the schema.xml file in a
 different location than where it resides now. Is this possible? Thanks.
 
  Sincerely,
  Alex
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Solr search – Tika extracted text from PDF not return highlighting snippet

2012-08-22 Thread anarchos78
Thanks for your reply,
I had tryied many things (copy field etc) with no succes. Notice that the
pdfs are stored as BLOB in mysql database. I am trying to use DIH in order
to fetch the binaries from DB. Is it possible?
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002587.html
Sent from the Solr - User mailing list archive at Nabble.com.


Weighted Search Results / Multi-Value Value's Not Aggregating Weight

2012-08-22 Thread David Radunz

Hey,

I have been having some problems getting good search results when 
using weighting against many fields with multi-values. After quite a bit 
of testing it seems to me that the problem is (at least as far as my 
query is concerned) is that the only one weighting is taken into account 
per field. For example, in a multi-value field if we have Comedy and 
Romance and have separate weightings for those - the one with the 
highest weighting is used (and not a combined weighting). Which means 
that searched for romantic comedy returns Alvin and the Chipmunks 
(Family, Children Comedy).


Query:

facet=onfl=id,name,matching_genres,score,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_datesort=score+desc,retail_rating+desc,release_date+descstart=q=**+-sku:1019660+-movie_id:1805+-movie_id:1806+(series_names_attr_opt_id:454282^9000+OR+cat_id:22^9+OR+cat_id:248^9+OR+cat_id:249^9+OR+matching_genres:Comedy^9+OR+matching_genres:Romance^7+OR+matching_genres:Drama^5)fq=store_id:1+AND+avail_status_attr_opt_id:available+AND+(format_attr_opt_id:372619)rows=4

Now if I change matching_genres:Romance^7 to 
matching_genres:Romance^70 (adding a 0) suddenly the first result 
is Sex and the City: The Movie / Sex and the City 2 (which ironically 
is Drama, Comedy, Romance - The very combination we are looking for).


So is there a way to structure my query so that all of the 
multi-value values are treated individually? Aggregating the 
weighting/score?


Thanks in advance!

David



Solr - case-insensitive search do not work

2012-08-22 Thread meghana
I want to apply case-insensitive search for field *myfield* in solr.

I googled a bit for that , and i found that , i need to apply
*LowerCaseFilterFactory *to Field Type and field should be of
solr.TextFeild.

I applied that in my *schema.xml* and re-index the data, then also my search
seems to be case-sensitive.

Below is search that i perform.
*
http://localhost:8080/solr/select?q=myfield:cloud
universityhl=onhl.snippets=99hl.fl=myfield*

Below is definition for field type

 fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/


filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

and below is my field definition

 field name=myfield type=text_en_splitting indexed=true stored=true
/

Not sure , what is wrong with this. Please help me to resolve this.

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
 filter class=solr.LowerCaseFilterFactory/ is already present in your
field type definition (its twice now)

Are you adding quotes around your query by any chance?

Ravish

On Wed, Aug 22, 2012 at 11:31 AM, meghana meghana.rav...@amultek.comwrote:

 I want to apply case-insensitive search for field *myfield* in solr.

 I googled a bit for that , and i found that , i need to apply
 *LowerCaseFilterFactory *to Field Type and field should be of
 solr.TextFeild.

 I applied that in my *schema.xml* and re-index the data, then also my
 search
 seems to be case-sensitive.

 Below is search that i perform.
 *
 http://localhost:8080/solr/select?q=myfield:cloud
 universityhl=onhl.snippets=99hl.fl=myfield*

 Below is definition for field type

  fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/


 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
 /fieldType

 and below is my field definition

  field name=myfield type=text_en_splitting indexed=true
 stored=true
 /

 Not sure , what is wrong with this. Please help me to resolve this.

 Thanks




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr - case-insensitive search do not work

2012-08-22 Thread meghana
@Ravish Bhagdev , Yes I am adding double quotes around my search , as shown
in my post. Like,

myfield:cloud university





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does DIH commit during large import?

2012-08-22 Thread Alexandre Rafalovitch
Thanks, I will look into autoCommit.

I assume there are memory implications of not committing? Or is it
just writing in a separate file and can theoretically do it
indefinitely?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Aug 22, 2012 at 2:42 AM, Lance Norskog goks...@gmail.com wrote:
 Solr has a separate feature called 'autoCommit'. This is configured in
 solrconfig.xml. You can set Solr to commit all documents every N
 milliseconds or every N documents, whichever comes first. If you want
 intermediate commits during a long DIH session, you have to use this
 or make your own script that does commits.

 On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey s...@elyograg.org wrote:
 On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:

 I am doing an import of large records (with large full-text fields)
 and somewhere around 30 records DataImportHandler runs out of
 memory (Heap) on a TIKA import (triggered from custom Processor) and
 does roll-back. I am using store=false and trying some tricks and
 tracking possible memory leaks, but also have a question about DIH
 itself.

 What actually happens when I run DIH on a large (XML Source) job? Does
 it accumulate some sort of status in memory that it commits at the
 end? If so, can I do intermediate commits to drop the memory
 requirements? Or, will it help to do several passes over the same
 dataset and import only particular entries at a time? I am using the
 Solr 4 (alpha) UI, so I can see some of the options there.


 I use Solr 3.5 and a MySQL database for import, so my setup may not be
 completely relevant, but here is my experience.

 Unless you turn on autocommit in solrconfig, documents will not be
 searchable during the import.  If you have commit=true for DIH (which I
 believe is the default), there will be a commit at the end of the import.

 It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
 that is suspected to be a bug in Tika rather than Solr.  The issue details
 talk about some workarounds for those who are familiar with Tika -- I'm not.
 The issue URL:

 https://issues.apache.org/jira/browse/SOLR-2886

 Thanks,
 Shawn




 --
 Lance Norskog
 goks...@gmail.com


Runtime.exec() not working on Tomcat

2012-08-22 Thread 122jxgcn
I have following code on my Apache Tika Maven project.

This code works when I test locally, but fails when it's attached as
external jar in Apache Solr (container is Tomcat).

String cmd; contains command string that will convert file with input as

./convert.bin input.custom output.xml

I checked that convert.bin and input.custom exists.


String cmd; // As explained above
File out = new File(dir_path, output.xml); // dir_path is file path

Process ps = null;

try {
ps = Runtime.getRuntime().exec(cmd); // execute command
int exitVal = ps.waitFor();
logger.info(Executing Runtime successful with exit value of  +
exitVal); 
// exitVal is 0
} catch (Exception e) {
logger.error(Exception in executing Runtime:  + e); // not reaching
here
}

// I get Out file does not exist, although I should get the proper output   
if (out.exists()) logger.info(Out file exists]);
else logger.info(Out file does not exist]); // reaches here

out.setWritable(true);
out.setReadable(true);
out.setExecutable(true);
out.deleteOnExit();

// I get FileNotFoundException here
InputStream xml_stream = new FileInputStream(out);


I'm really confused because I get the right result locally (Maven test), but
not when it is on Tomcat.

Any help please?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Runtime-exec-not-working-on-Tomcat-tp4002614.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Runtime.exec() not working on Tomcat

2012-08-22 Thread Alexandre Rafalovitch
Could it be different 'current' working directories? What happens if
you hardcode the full path into the command and input/output files?

./convert.bin - /Dev/Solr/bin/convert.bin, etc.

Also, you may want to use some file system observation tools to figure
out exactly what file is touched where. Look for dtrace on Unix-like
systems and for SysInternals ProcMon on Windows.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Aug 22, 2012 at 7:18 AM, 122jxgcn ywpar...@gmail.com wrote:
 I have following code on my Apache Tika Maven project.

 This code works when I test locally, but fails when it's attached as
 external jar in Apache Solr (container is Tomcat).

 String cmd; contains command string that will convert file with input as

 ./convert.bin input.custom output.xml


Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
OK.  Try without quotes like myfield:cloud+university and see if it has any
effect.

Also, try both queries with debugging turned on and post the output of the
same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging )

It must be some field configuration issue or that double quotes are causing
some analyzers to not work on your query.

Hope this helps.

Ravish

On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote:

 @Ravish Bhagdev , Yes I am adding double quotes around my search , as shown
 in my post. Like,

 myfield:cloud university





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
Also, try comparing your field configuration to Solrs default text field
and see if you can spot any differences.

Ravish

On Wed, Aug 22, 2012 at 1:09 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote:

 OK.  Try without quotes like myfield:cloud+university and see if it has
 any effect.

 Also, try both queries with debugging turned on and post the output of the
 same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging )

 It must be some field configuration issue or that double quotes are
 causing some analyzers to not work on your query.

 Hope this helps.

 Ravish

 On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote:

 @Ravish Bhagdev , Yes I am adding double quotes around my search , as
 shown
 in my post. Like,

 myfield:cloud university





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Edismax parser weird behavior

2012-08-22 Thread amitesh116
Hi I am experiencing 2 strange behavior in edismax:
edismax is configured to behave default OR (using mm=0) 
Total there are 700 results
1. Search for *auto* = *50 results*
   Search for *NOT auto* it gives *651 results*. 
Mathematically, it should give only 650 results for *NOT auto*.

2. Search for *auto*  = 50 results
 Search for *car =  100 results*
Search for *auto and car = 10 results*
Since we have set mm=0, it should behave like OR and results for auto and
car would be more than 100 at least

Please help me, understand these two issues. Are these normal behavior? Do I
need to tweak the query? Or do I need to look into config or scheam xml
files.

Thanks in Advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Edismax-parser-weird-behavior-tp4002626.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tirthankar Chatterjee
You can collapse in each Shards as a separate query

Lance Norskog goks...@gmail.com wrote:


How do you separate the documents among the shards? Can you set up the
shards such that one collapse group is only on a single shard? That
you never have to do distributed grouping?

On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar

 -Original Message-
 From: Tom Burton-West tburt...@umich.edu
 Date: Tue, 21 Aug 2012 18:39:25
 To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
 Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
 Subject: Scalability of Solr Result Grouping/Field Collapsing:
  Millions/Billions of documents?

 Hello all,

 We are thinking about using Solr Field Collapsing on a rather large scale
 and wonder if anyone has experience with performance when doing Field
 Collapsing on millions of or billions of documents (details below. )  Are
 there performance issues with grouping large result sets?

 Details:
 We have a collection of the full text of 10 million books/journals.  This
 is spread across 12 shards with each shard holding about 800,000
 documents.  When a query matches a journal article, we would like to group
 all the matching articles from the same journal together. (there is a
 unique id field identifying the journal).  Similarly when there is a match
 in multiple copies of the same book we would like to group all results for
 the same book together (again we have a unique id field we can group on).
 Sometimes a short query against the OCR field will result in over one
 million hits.  Are there known performance issues when field collapsing
 result sets containing a million hits?

 We currently index the entire book as one Solr document.  We would like to
 investigate the feasibility of indexing each page as a Solr document with a
 field indicating the book id.  We could then offer our users the choice of
 a list of the most relevant pages, or a list of the books containing the
 most relevant pages.  We have approximately 3 billion pages.   Does anyone
 have experience using field collapsing on this sort of scale?

 Tom

 Tom Burton-West
 Information Retrieval Programmer
 Digital Library Production Service
 Univerity of Michigan Library
 http://www.hathitrust.org/blogs/large-scale-search
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *



--
Lance Norskog
goks...@gmail.com


Re: SpellCheck Component does not work for certain words

2012-08-22 Thread mechravi25
Hi,

Just few things to add up, I found that when we search for less than or
equal to 3 letters I'm not able to get any suggestions and also when I
search for finding, I dont get any suggestions related to it even though i
have search results regarding the same.

But when i Search for findingg i get suggestions for it and one of the
suggestions is finding and in this case the search results are zero.

Can you tell me if this is the way the spell check is intended to work or am
I going wrong some where?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SpellCheck-Component-does-not-work-for-certain-words-tp4002573p4002636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: display SOLR Query in web page

2012-08-22 Thread Michael Della Bitta
Ouch, not to mention the potential for XSS.

I'll see if I can get in touch with someone.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 3:40 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Now this is very scary, while searching for solr direct access per docid I 
 got a hit
 from US Homeland Security Digital Library. Interested in what they have to 
 tell me
 about my search I clicked on the link to the page. First the page had nothing 
 unusual
 about it, but why I get the hit?
 http://www.hsdl.org/?collection/stratpolid=4

 Inspecting the page source view shows that they have the solr query displayed 
 direct
 on their page as span with style=display:none.
 -- snippet --
 !-- Search Results --

 span style=display: none;*** SOLR Query *** mdash; q=Collection:0 AND 
 (TabSection:(Congressional hearings and testimony, Congressional
 reports, Congressional resolutions, Directives (presidential), 
 Executive orders, Major Legislation, Public laws, Reports (CBO),
 Reports (CHDS), Reports (CRS),...
 ...
 AND (Title_nostem:(China Forces Senior Intelligence Officer)^10 
 AlternateTitle_nostem:(China Forces Senior Intelligence
 Officer)^9)sort=score
 descrows=30start=0indent=offfacet=onfacet.limit=1facet.mincount=1fl=AlternateTitle_text,Collection,CoverageCountry,CoverageState,Creator_nostem,DateLastModified,DateOfRecordEntry,Description_text,DisplayDate,DocID,ExternalDocId,ExternalDocSource,FileDate,FileExtension,FileSize,FileTitle_text,Format,Language,PublishDate,Publisher_text,Publisher_nostem,ReportNumber,ResourceType,RetrievedFrom,Rights,Subjects,Source,TabSection,Title_text,URL_text,Alternate_URL_text,CreatedBy,ModifiedBy,Noteswt=phpsfacet.field=Creatorfacet.field=Formatfacet.field=Languagefacet.field=Publisherfacet.field=TabSection/span
 -- snippet --

 As you can see I have searched for China Forces Senior Intelligence Officer 
 so this is directly showing the
 query string.
 Do they know that there is also a delete by query?
 And the are also escape sequences?

 This is what I call scary.
 Maybe some of the US fellows can give them a hint and a helping hand.

 Regards
 Bernd


Re: display SOLR Query in web page

2012-08-22 Thread Michael Della Bitta
Actually, I'm having a little trouble coming up with a
proof-of-concept exploit for this... it doesn't seem like Solr is
exposed directly, and it does seem like it's escaping submitted
content before redisplaying it on the page.

I'm not crazy about leaking the raw query string into the HTML, but it
doesn't seem to lead to more than just that.

Please let me know if I am missing something, it's still morningtime
here in the US and I haven't had enough coffee yet. :)

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 9:32 AM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Ouch, not to mention the potential for XSS.

 I'll see if I can get in touch with someone.

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game


 On Wed, Aug 22, 2012 at 3:40 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Now this is very scary, while searching for solr direct access per docid I 
 got a hit
 from US Homeland Security Digital Library. Interested in what they have to 
 tell me
 about my search I clicked on the link to the page. First the page had 
 nothing unusual
 about it, but why I get the hit?
 http://www.hsdl.org/?collection/stratpolid=4

 Inspecting the page source view shows that they have the solr query 
 displayed direct
 on their page as span with style=display:none.
 -- snippet --
 !-- Search Results --

 span style=display: none;*** SOLR Query *** mdash; q=Collection:0 AND 
 (TabSection:(Congressional hearings and testimony, Congressional
 reports, Congressional resolutions, Directives (presidential), 
 Executive orders, Major Legislation, Public laws, Reports (CBO),
 Reports (CHDS), Reports (CRS),...
 ...
 AND (Title_nostem:(China Forces Senior Intelligence Officer)^10 
 AlternateTitle_nostem:(China Forces Senior Intelligence
 Officer)^9)sort=score
 descrows=30start=0indent=offfacet=onfacet.limit=1facet.mincount=1fl=AlternateTitle_text,Collection,CoverageCountry,CoverageState,Creator_nostem,DateLastModified,DateOfRecordEntry,Description_text,DisplayDate,DocID,ExternalDocId,ExternalDocSource,FileDate,FileExtension,FileSize,FileTitle_text,Format,Language,PublishDate,Publisher_text,Publisher_nostem,ReportNumber,ResourceType,RetrievedFrom,Rights,Subjects,Source,TabSection,Title_text,URL_text,Alternate_URL_text,CreatedBy,ModifiedBy,Noteswt=phpsfacet.field=Creatorfacet.field=Formatfacet.field=Languagefacet.field=Publisherfacet.field=TabSection/span
 -- snippet --

 As you can see I have searched for China Forces Senior Intelligence 
 Officer so this is directly showing the
 query string.
 Do they know that there is also a delete by query?
 And the are also escape sequences?

 This is what I call scary.
 Maybe some of the US fellows can give them a hint and a helping hand.

 Regards
 Bernd


Re: Solr - case-insensitive search do not work

2012-08-22 Thread meghana
Hi Ravish , the defination for text_en_splitting in solr default schema and
of mine are same.. still its not working... any idea?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002645.html
Sent from the Solr - User mailing list archive at Nabble.com.


Problem to start solr-4.0.0-BETA with tomcat-6.0.20

2012-08-22 Thread Claudio Ranieri
Hi,

I tried to start the solr-4.0.0-BETA with tomcat-6.0.20 but does not work.
I copied the apache-solr-4.0.0-BETA.war to $TOMCAT_HOME/webapps. Then I copied 
the directory apache-solr-4.0.0-BETA\example\solr to C:\home\solr-4.0-beta and 
adjusted the file 
$TOMCAT_HOME\conf\Catalina\localhost\apache-solr-4.0.0-BETA.xml to point the 
solr/home to C:/home/solr-4.0-beta. With this configuration, when I startup 
tomcat I got:

SEVERE: org.apache.solr.common.SolrException: Invalid luceneMatchVersion 
'LUCENE_40', valid values ​​are: [LUCENE_20, LUCENE_21, LUCENE_22, LUCENE_23, 
LUCENE_24, LUCENE_29, LUCENE_30, LUCENE_31, LUCENE_32, LUCENE_33, LUCENE_34, 
LUCENE_35, LUCENE_36, LUCENE_CURRENT ] or a string in format 'VV'

So I changed the line in solrconfig.xml:

luceneMatchVersionLUCENE_40/luceneMatchVersion

to

luceneMatchVersionLUCENE_CURRENT/luceneMatchVersion

So I got a new error:

Caused by: java.lang.ClassNotFoundException: solr.NRTCachingDirectoryFactory

This class is within the file apache-solr-core-4.0.0-BETA.jar but for some 
reason classloader of the class is not loaded. I then moved all jars in 
$TOMCAT_HOME\webapps\apache-solr-4.0.0-BETA\WEB-INF\lib to $TOMCAT_HOME\lib.
After this setup, I got a new error:

SEVERE: java.lang.ClassCastException: 
org.apache.solr.core.NRTCachingDirectoryFactory can not be cast to 
org.apache.solr.core.DirectoryFactory

So I changed the line in solrconfig.xml:

directoryFactory name=DirectoryFactory

class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/

to

directoryFactory name=DirectoryFactory

class=${solr.directoryFactory:solr.NIOFSDirectoryFactory}/

So I got a new error:

Caused by: java.lang.ClassCastException: 
org.apache.solr.spelling.DirectSolrSpellChecker can not be cast to 
org.apache.solr.spelling.SolrSpellChecker

How can I resolve the problem of classloader?
How can I resolve the problem of cast of NRTCachingDirectoryFactory and 
DirectSolrSpellChecker?
I can not startup the solr 4.0 beta with tomcat.
Thanks,






Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
Did you see my message about debugging parameters?  Try that and see what's
happening behind the scenes.

I can confirm that by default the queries are NOT case sensitive.

Ravish

On Wed, Aug 22, 2012 at 2:45 PM, meghana meghana.rav...@amultek.com wrote:

 Hi Ravish , the defination for text_en_splitting in solr default schema and
 of mine are same.. still its not working... any idea?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002645.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: display SOLR Query in web page

2012-08-22 Thread Bernd Fehling
I haven't spent time in trying anything, just entered a query and recognized
that it showed up in the page source view.
If they really escape everything it is not that dangerous?

Actually I don't want to try anything with their page,
they might not have any humor ;-)

Bernd


Am 22.08.2012 15:41, schrieb Michael Della Bitta:
 Actually, I'm having a little trouble coming up with a
 proof-of-concept exploit for this... it doesn't seem like Solr is
 exposed directly, and it does seem like it's escaping submitted
 content before redisplaying it on the page.
 
 I'm not crazy about leaking the raw query string into the HTML, but it
 doesn't seem to lead to more than just that.
 
 Please let me know if I am missing something, it's still morningtime
 here in the US and I haven't had enough coffee yet. :)
 
 Michael Della Bitta
 
 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game
 
 
 On Wed, Aug 22, 2012 at 9:32 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 Ouch, not to mention the potential for XSS.

 I'll see if I can get in touch with someone.

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game


 On Wed, Aug 22, 2012 at 3:40 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Now this is very scary, while searching for solr direct access per docid 
 I got a hit
 from US Homeland Security Digital Library. Interested in what they have to 
 tell me
 about my search I clicked on the link to the page. First the page had 
 nothing unusual
 about it, but why I get the hit?
 http://www.hsdl.org/?collection/stratpolid=4

 Inspecting the page source view shows that they have the solr query 
 displayed direct
 on their page as span with style=display:none.
 -- snippet --
 !-- Search Results --

 span style=display: none;*** SOLR Query *** mdash; q=Collection:0 AND 
 (TabSection:(Congressional hearings and testimony, Congressional
 reports, Congressional resolutions, Directives (presidential), 
 Executive orders, Major Legislation, Public laws, Reports (CBO),
 Reports (CHDS), Reports (CRS),...
 ...
 AND (Title_nostem:(China Forces Senior Intelligence Officer)^10 
 AlternateTitle_nostem:(China Forces Senior Intelligence
 Officer)^9)sort=score
 descrows=30start=0indent=offfacet=onfacet.limit=1facet.mincount=1fl=AlternateTitle_text,Collection,CoverageCountry,CoverageState,Creator_nostem,DateLastModified,DateOfRecordEntry,Description_text,DisplayDate,DocID,ExternalDocId,ExternalDocSource,FileDate,FileExtension,FileSize,FileTitle_text,Format,Language,PublishDate,Publisher_text,Publisher_nostem,ReportNumber,ResourceType,RetrievedFrom,Rights,Subjects,Source,TabSection,Title_text,URL_text,Alternate_URL_text,CreatedBy,ModifiedBy,Noteswt=phpsfacet.field=Creatorfacet.field=Formatfacet.field=Languagefacet.field=Publisherfacet.field=TabSection/span
 -- snippet --

 As you can see I have searched for China Forces Senior Intelligence 
 Officer so this is directly showing the
 query string.
 Do they know that there is also a delete by query?
 And the are also escape sequences?

 This is what I call scary.
 Maybe some of the US fellows can give them a hint and a helping hand.

 Regards
 Bernd

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie
Universitätsstr. 25 und Wissensmanagement
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


search is slow for URL fields of type String.

2012-08-22 Thread srinalluri
This is string fieldType:

fieldType name=string class=solr.StrField sortMissingLast=true /

These are the filelds using 'string' fieldType:

  field name=image_url type=string indexed=true stored=true
multiValued=true /
  field name=url type=string indexed=true stored=true
multiValued=true /

And this the sample query:
/select/?q=url:http\://www.foxbusiness.com/personal-finance/2012/08/10/social-change-coming-from-gas-prices-to-rent-prices-and-beyond/
AND image_url:*

Each query like this taking around 400 milli seconds. What are the change I
can do to the fieldType to improve query performance?

thanks
Srini



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-is-slow-for-URL-fields-of-type-String-tp4002662.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr memory: CATALINA_OPTS in setenv.sh ?

2012-08-22 Thread Bruno Mannina

Dear users,

I try to know if my add in the setenv.sh (which I need to create because 
it didn't exist) file has been set but when I click on the link Java 
Properties on Admin Solr web page

I can't see the variable CATALINA_OPTS.

In fact, I would like to know if my line added in the file setenv.sh is ok:
|CATALINA_OPTS=||-server -Xss7G -Xms14G -Xmx14G $CATALINA_OPTS 
-XX:+UseConcMarkSweepGC -XX:NewSize=7G -XX:+UseParNewGC|


my setenv.sh file contains only this line (inside /usr/share/tomcat6/bin/).

How can I see if memroy is well allocated ?

Other question: is |*-XX:NewSize=7G* is ok?|

I have 24Go Ram (14G ~60%)


Re: display SOLR Query in web page

2012-08-22 Thread Michael Della Bitta
It's not great to leak internal implementation details of your
application out like this, and it may be that someone more skilled at
exploiting things like this could find one.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 10:20 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 I haven't spent time in trying anything, just entered a query and recognized
 that it showed up in the page source view.
 If they really escape everything it is not that dangerous?

 Actually I don't want to try anything with their page,
 they might not have any humor ;-)

 Bernd


 Am 22.08.2012 15:41, schrieb Michael Della Bitta:
 Actually, I'm having a little trouble coming up with a
 proof-of-concept exploit for this... it doesn't seem like Solr is
 exposed directly, and it does seem like it's escaping submitted
 content before redisplaying it on the page.

 I'm not crazy about leaking the raw query string into the HTML, but it
 doesn't seem to lead to more than just that.

 Please let me know if I am missing something, it's still morningtime
 here in the US and I haven't had enough coffee yet. :)

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game


 On Wed, Aug 22, 2012 at 9:32 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 Ouch, not to mention the potential for XSS.

 I'll see if I can get in touch with someone.

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game


 On Wed, Aug 22, 2012 at 3:40 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Now this is very scary, while searching for solr direct access per docid 
 I got a hit
 from US Homeland Security Digital Library. Interested in what they have to 
 tell me
 about my search I clicked on the link to the page. First the page had 
 nothing unusual
 about it, but why I get the hit?
 http://www.hsdl.org/?collection/stratpolid=4

 Inspecting the page source view shows that they have the solr query 
 displayed direct
 on their page as span with style=display:none.
 -- snippet --
 !-- Search Results --

 span style=display: none;*** SOLR Query *** mdash; q=Collection:0 AND 
 (TabSection:(Congressional hearings and testimony, Congressional
 reports, Congressional resolutions, Directives (presidential), 
 Executive orders, Major Legislation, Public laws, Reports (CBO),
 Reports (CHDS), Reports (CRS),...
 ...
 AND (Title_nostem:(China Forces Senior Intelligence Officer)^10 
 AlternateTitle_nostem:(China Forces Senior Intelligence
 Officer)^9)sort=score
 descrows=30start=0indent=offfacet=onfacet.limit=1facet.mincount=1fl=AlternateTitle_text,Collection,CoverageCountry,CoverageState,Creator_nostem,DateLastModified,DateOfRecordEntry,Description_text,DisplayDate,DocID,ExternalDocId,ExternalDocSource,FileDate,FileExtension,FileSize,FileTitle_text,Format,Language,PublishDate,Publisher_text,Publisher_nostem,ReportNumber,ResourceType,RetrievedFrom,Rights,Subjects,Source,TabSection,Title_text,URL_text,Alternate_URL_text,CreatedBy,ModifiedBy,Noteswt=phpsfacet.field=Creatorfacet.field=Formatfacet.field=Languagefacet.field=Publisherfacet.field=TabSection/span
 -- snippet --

 As you can see I have searched for China Forces Senior Intelligence 
 Officer so this is directly showing the
 query string.
 Do they know that there is also a delete by query?
 And the are also escape sequences?

 This is what I call scary.
 Maybe some of the US fellows can give them a hint and a helping hand.

 Regards
 Bernd

 --
 *
 Bernd FehlingUniversitätsbibliothek Bielefeld
 Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie
 Universitätsstr. 25 und Wissensmanagement
 33615 Bielefeld
 Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

 BASE - Bielefeld Academic Search Engine - www.base-search.net
 *


Re: Solr Score threshold 'reasonably', independent of results returned

2012-08-22 Thread Mou
Hi,
I think that this totally depends on your requirements and thus applicable
for a user scenario. Score does not have any absolute meaning, it is always
relative to the query. If you want to watch some particular queries and want
to show results with score above previously set threshold, you can use this. 

If I always have that x% threshold in place , there may be many queries
which would not return anything and I certainly do not want that.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Score-threshold-reasonably-independent-of-results-returned-tp4002312p4002673.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Co-existing solr cloud installations

2012-08-22 Thread Buttler, David
This is really nice.  Thanks for pointing it out.
Dave

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Tuesday, August 21, 2012 8:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Co-existing solr cloud installations

You can use a connect string of host:port/path to 'chroot' a path. I
think currently you have to manually create the path first though. See
the ZkCli tool (doc'd on SolrCloud wiki) for a simple way to do that.

I keep meaning to look into auto making it if it doesn't exist, but
have not gotten to it.

- Mark

On Tue, Aug 21, 2012 at 4:46 PM, Buttler, David buttl...@llnl.gov wrote:
 Hi all,
 I would like to use a single zookeeper cluster to manage multiple Solr cloud 
 installations.  However, the current design of how Solr uses zookeeper seems 
 to preclude that.  Have I missed a configuration option to set a zookeeper 
 prefix for all of a Solr cloud configuration directories?

 If I look at the zookeeper data it looks like:

  * /clusterstate.json
  * /collections
  * /configs
  * /live_nodes
  * /overseer
  * /overseer_elect
  * /zookeeper
 Is there a reason not to put all of these nodes under some user-configurable 
 higher-level node, such as /solr4?
 It could have a reasonable default value to make it just as easy to find as /.

 My current issue is that I have an old Solr cloud instance from back in the 
 Solr 1.5 days, and I don't expect that the new version and the old version 
 will play nice.

 Thanks,
 Dave



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance,

I don't understand enough of how the field collapsing is implemented, but I
thought it worked with distributed search.  Are you saying it only works if
everything that needs collapsing is on the same shard?

Tom

On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog goks...@gmail.com wrote:

 How do you separate the documents among the shards? Can you set up the
 shards such that one collapse group is only on a single shard? That
 you never have to do distributed grouping?

 On Tue, Aug 21, 2012 at 4:10 PM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
  This wont work, see my thread on Solr3.6 Field collapsing
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Tom Burton-West tburt...@umich.edu
  Date: Tue, 21 Aug 2012 18:39:25
  To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org
  Reply-To: solr-user@lucene.apache.org solr-user@lucene.apache.org
  Cc: William Dueberdueb...@umich.edu; Phillip Farberpfar...@umich.edu
  Subject: Scalability of Solr Result Grouping/Field Collapsing:
   Millions/Billions of documents?
 
  Hello all,
 
  We are thinking about using Solr Field Collapsing on a rather large scale
  and wonder if anyone has experience with performance when doing Field
  Collapsing on millions of or billions of documents (details below. )  Are
  there performance issues with grouping large result sets?
 
  Details:
  We have a collection of the full text of 10 million books/journals.  This
  is spread across 12 shards with each shard holding about 800,000
  documents.  When a query matches a journal article, we would like to
 group
  all the matching articles from the same journal together. (there is a
  unique id field identifying the journal).  Similarly when there is a
 match
  in multiple copies of the same book we would like to group all results
 for
  the same book together (again we have a unique id field we can group on).
  Sometimes a short query against the OCR field will result in over one
  million hits.  Are there known performance issues when field collapsing
  result sets containing a million hits?
 
  We currently index the entire book as one Solr document.  We would like
 to
  investigate the feasibility of indexing each page as a Solr document
 with a
  field indicating the book id.  We could then offer our users the choice
 of
  a list of the most relevant pages, or a list of the books containing the
  most relevant pages.  We have approximately 3 billion pages.   Does
 anyone
  have experience using field collapsing on this sort of scale?
 
  Tom
 
  Tom Burton-West
  Information Retrieval Programmer
  Digital Library Production Service
  Univerity of Michigan Library
  http://www.hathitrust.org/blogs/large-scale-search
  **Legal Disclaimer***
  This communication may contain confidential and privileged
  material for the sole use of the intended recipient. Any
  unauthorized review, use or distribution by others is strictly
  prohibited. If you have received the message in error, please
  advise the sender by reply email and delete the message. Thank
  you.
  *



 --
 Lance Norskog
 goks...@gmail.com



Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar,

Can you give me a quick summary of what   won't work and why?
I couldn't figure it out from looking at your thread.  You seem to have a
different issue, but maybe I'm missing something here.

Tom

On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

 This wont work, see my thread on Solr3.6 Field collapsing
 Thanks,
 Tirthankar




Re: Solr memory: CATALINA_OPTS in setenv.sh ?

2012-08-22 Thread Bruno Mannina

Le 22/08/2012 16:57, Bruno Mannina a écrit :

Dear users,

I try to know if my add in the setenv.sh (which I need to create 
because it didn't exist) file has been set but when I click on the 
link Java Properties on Admin Solr web page

I can't see the variable CATALINA_OPTS.

In fact, I would like to know if my line added in the file setenv.sh 
is ok:
|CATALINA_OPTS=||-server -Xss7G -Xms14G -Xmx14G $CATALINA_OPTS 
-XX:+UseConcMarkSweepGC -XX:NewSize=7G -XX:+UseParNewGC|


my setenv.sh file contains only this line (inside 
/usr/share/tomcat6/bin/).


How can I see if memroy is well allocated ?

Other question: is |*-XX:NewSize=7G* is ok?|

I have 24Go Ram (14G ~60%)

I changed the method, I edited the file tomcat6 in /etc/init.d and I 
modify the JAVA_OPTS var to:

JAVA_OPTS=-server -Djava.awt.headless=true -Xms14G -Xmx14G

Do you think it's correct if I have 24Go Ram?
Do you think something is missing ? like Xss or other ?

I found many google pages but not really a page that explain how to 
choose the right configuration.

I think there isn't a unique answer to this question.

it seems there are several methods to adjust memory for JVM but what is 
the best ?


Re: Solr Score threshold 'reasonably', independent of results returned

2012-08-22 Thread Ravish Bhagdev
Commercial solutions often have %age that is meant to signify the quality
of match.  Solr has relative score and you cannot tell by just looking at
this value if a result is relevant enough to be in first page or not.
 Score depends on what else is in the index so not easy to normalize in
the way you suggest.

Ravish

On Wed, Aug 22, 2012 at 4:03 PM, Mou mouna...@gmail.com wrote:

 Hi,
 I think that this totally depends on your requirements and thus applicable
 for a user scenario. Score does not have any absolute meaning, it is always
 relative to the query. If you want to watch some particular queries and
 want
 to show results with score above previously set threshold, you can use
 this.

 If I always have that x% threshold in place , there may be many queries
 which would not return anything and I certainly do not want that.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Score-threshold-reasonably-independent-of-results-returned-tp4002312p4002673.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Query-side Join work in distributed Solr?

2012-08-22 Thread Timothy Potter
Just to clarify that query-side joins ( e.g. {!join from=id
to=parent_signal_id_s}id:foo ) do not work in a distributed mode yet?
I saw LUCENE-3759 as unresolved but also some some Twitter traffic
saying there was a patch available.

Cheers,
Tim


Re: Solr memory: CATALINA_OPTS in setenv.sh ?

2012-08-22 Thread Michael Della Bitta
Check your cores' status page and see if you're running the
MMapDirectory (you probably are.)

In that case, you probably want to devote even less RAM to Tomcat's
heap because the index files are being read out of memory-mapped pages
that don't reside on the heap, so you'd be devoting more memory to
caching them if you freed it up by lowering the heap.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 12:05 PM, Bruno Mannina bmann...@free.fr wrote:
 Le 22/08/2012 16:57, Bruno Mannina a écrit :

 Dear users,

 I try to know if my add in the setenv.sh (which I need to create because
 it didn't exist) file has been set but when I click on the link Java
 Properties on Admin Solr web page
 I can't see the variable CATALINA_OPTS.

 In fact, I would like to know if my line added in the file setenv.sh is
 ok:
 |CATALINA_OPTS=||-server -Xss7G -Xms14G -Xmx14G $CATALINA_OPTS
 -XX:+UseConcMarkSweepGC -XX:NewSize=7G -XX:+UseParNewGC|

 my setenv.sh file contains only this line (inside
 /usr/share/tomcat6/bin/).

 How can I see if memroy is well allocated ?

 Other question: is |*-XX:NewSize=7G* is ok?|

 I have 24Go Ram (14G ~60%)

 I changed the method, I edited the file tomcat6 in /etc/init.d and I modify
 the JAVA_OPTS var to:
 JAVA_OPTS=-server -Djava.awt.headless=true -Xms14G -Xmx14G

 Do you think it's correct if I have 24Go Ram?
 Do you think something is missing ? like Xss or other ?

 I found many google pages but not really a page that explain how to choose
 the right configuration.
 I think there isn't a unique answer to this question.

 it seems there are several methods to adjust memory for JVM but what is the
 best ?


Re: Solr 3.6.1: query performance is slow when asterisk is in the query

2012-08-22 Thread david3s
Hello Chris, thanks a lot for your reply. But is there an alternative
solution? Because I see adding has_body as data duplication.

Imagine in that in a Relational DB you had to create extra columns because
you can't do something like where body is not null

If there's no other alternative I'll have to go with your suggestion that I
greatly appreciate.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.6.1: query performance is slow when asterisk is in the query

2012-08-22 Thread Michael Della Bitta
The name of the game for performance and functionality in Solr quite
often *denormalization*, which might run against your RDBMS instincts,
but once you embrace it, you'll find that things go a lot more
smoothly.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 22, 2012 at 12:37 PM, david3s davi...@hotmail.com wrote:
 Hello Chris, thanks a lot for your reply. But is there an alternative
 solution? Because I see adding has_body as data duplication.

 Imagine in that in a Relational DB you had to create extra columns because
 you can't do something like where body is not null

 If there's no other alternative I'll have to go with your suggestion that I
 greatly appreciate.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Index version generation for Solr 3.5

2012-08-22 Thread Xin Li
Hi,

I ran into an issue lately with Index version  generation for Solr 3.5.

In Solr 1.4., the index version of slave service increments upon each
replication. However, I noticed it's not the case for Solr 3.5; the
index version would increase 20, or 30 after replication. Does anyone
know why and any reference on the web for this?
The index generation does still increment after replication though.

Thanks,

Xin


Re: Solr 3.6.1: query performance is slow when asterisk is in the query

2012-08-22 Thread Jack Krupansky
You could also add a bodySize numeric (trie) field, which you can check for 
0 for empty/missing bodies.


And don't forget to check and see whether the [* TO *] range query might 
be faster.


-- Jack Krupansky

-Original Message- 
From: david3s

Sent: Wednesday, August 22, 2012 12:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.6.1: query performance is slow when asterisk is in the 
query


Hello Chris, thanks a lot for your reply. But is there an alternative
solution? Because I see adding has_body as data duplication.

Imagine in that in a Relational DB you had to create extra columns because
you can't do something like where body is not null

If there's no other alternative I'll have to go with your suggestion that I
greatly appreciate.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002698.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Edismax parser weird behavior

2012-08-22 Thread Jack Krupansky
Don't have an immediate answer for you on #1, but for #2, mm does not 
override explicit operators - and - it only applies to terms that are not 
the immediate operand of an explicit operator. Note that by default 
lower-case operators are enabled in edismax - and is treated as AND - 
you can set lowercaseOperators=false to avoid that.


-- Jack Krupansky

-Original Message- 
From: amitesh116

Sent: Wednesday, August 22, 2012 8:13 AM
To: solr-user@lucene.apache.org
Subject: Edismax parser weird behavior

Hi I am experiencing 2 strange behavior in edismax:
edismax is configured to behave default OR (using mm=0)
Total there are 700 results
1. Search for *auto* = *50 results*
  Search for *NOT auto* it gives *651 results*.
Mathematically, it should give only 650 results for *NOT auto*.

2. Search for *auto*  = 50 results
Search for *car =  100 results*
Search for *auto and car = 10 results*
Since we have set mm=0, it should behave like OR and results for auto and
car would be more than 100 at least

Please help me, understand these two issues. Are these normal behavior? Do I
need to tweak the query? Or do I need to look into config or scheam xml
files.

Thanks in Advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Edismax-parser-weird-behavior-tp4002626.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr 3.6.1: query performance is slow when asterisk is in the query

2012-08-22 Thread david3s
Jack, sorry to forgot to answer you, we tried [* TO *] and the response
times are the same as doing plain *



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002708.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which directories are required in Solr?

2012-08-22 Thread Erick Erickson
Why do you care? I suspect that the example directory can be removed
assuming you're distributing the war file. But disk space is really cheap,
I suspect that tidying up the directories for aesthetic reasons isn't worth
the risk of removing something that you might need later...

Best
Erick

On Wed, Aug 22, 2012 at 3:32 AM, Alexander Cougarman acoug...@bwc.org wrote:
 Hi. Which folders/files can be deleted from the default Solr package 
 (apache-solr-3.6.1.zip) on Windows if all we'd like to do is index/store 
 documents? Thanks.

 Sincerely,
 Alex



Re: Which directories are required in Solr?

2012-08-22 Thread Geek Gamer
Hi,

checkout : https://github.com/geek4377/jetty-solr

you can remove exampledocs from the list to get only the required dirs for
running solr.


On Wed, Aug 22, 2012 at 1:02 PM, Alexander Cougarman acoug...@bwc.orgwrote:

 Hi. Which folders/files can be deleted from the default Solr package
 (apache-solr-3.6.1.zip) on Windows if all we'd like to do is index/store
 documents? Thanks.

 Sincerely,
 Alex




Re: Solr Custom Filter Factory - How to pass parameters?

2012-08-22 Thread Erick Erickson
I'm reaching a bit here, haven't implemented one myself, but...

It seems like you're just dealing with some shared memory. So say
your filter recorded all the stuff you want to put into the DB. When
you put stuff in to the shared memory, you probably have to figure
out when you should commit the batch (if you're indexing 100M docs,
you probably don't want to use up that much memory, but what do I know).
This is all done at the filter.

It seems like you could also create an  SolrEventListener on
the PostCommit event
(see: http://wiki.apache.org/solr/SolrPlugins#SolrEventListener)
to put whatever remained in your list into your DB.

Of course you'd have to do some synchronization so multiple threads
played nice with each other. And you'd have to be sure to fire a commit
at the end of your indexing process if you wanted some certainty that
everything was tidied up. If some delay isn't a problem and you have
autocommit configured, then your event listener would be called when
then next autocommit happened.

FWIW
Erick

On Tue, Aug 21, 2012 at 8:19 PM, ksu wildcats ksu.wildc...@gmail.com wrote:
 Jack

 Reading through the documentation for UpdateRequestProcessor my
 understanding is that its good for handling processing of documents before
 analysis.
 Is it true that processAdd (where we can have custom logic) is invoked once
 per document and is invoked before any of the analyzers gets invoked?

 I couldn't figure out how I can use UpdateRequestProcessor to access the
 tokens stored in memory by CustomFilterFactory/CustomFilter.

 Can you please provide more information on how I can use
 UpdateRequestProcessor to handle any post processing that needs to be done
 after all documents are added to the index?

 Also does CustomFilterFactory/CustomFilter has any ways to do post
 processing after all documents are added to index?

 Here is the code i have for CustomFilterFactory/CustomFilter. This might
 help understand what i am trying to do and may be there is a better way to
 do this.
 The main problem i have with this approach is that i am forced to write
 results stored in memory (customMap) to database per document and if i have
 1 million documents then thats 1 million db calls. I am trying to avoid the
 number of calls made to database by storing results in memory and write
 results to database once for every X documents (say, every 1 docs).

 public class CustomFilterFactory extends BaseTokenFilterFactory {
   public CustomFilter create(TokenStream input) {
 String databaseName = getArgs().get(paramname);
 return new CustomFilter(input, databasename);
  }
 }

 public class CustomFilter extends TokenFilter {
 private TermAttribute termAtt;
 MapTermAttribute, Integer customMap = new HashMapTermAttribute,
 Integer();
 String databasename = null;
   protected CustomFilter(TokenStream input, String databasename) {
   super(input);
   termAtt = (TermAttribute) addAttribute(TermAttribute.class);
   this.databasename  = databasename;
   }

   public final boolean incrementToken() throws IOException {
   if (!input.incrementToken()) {
   writeResultsToDB()
   return false;
   }

   if (addWordToCustomMap()) {
 // do some analysis on term and then populate 
 customMap
 // customMap.put(term,somevalue);
   }

   if (customMap.size()  commitSize) {
 writeResultsToDB()
   }
   return true;
   }

   boolean addWordToCustomMap() {
 // custom logic - some validation on term to determine if 
 this should be
 added to customMap
   }

   void writeResultsToDB() throws IOException {
 // custom logic that reads data from customMap, does some 
 analysis and
 writes them to database.
   }
 }





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-tp4002217p4002531.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance and Tirthankar,

We are currently using Solr 3.6.  I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)

I then searched for the record_no of the second group in the results to
confirm that the number of records being folded is correct. In both cases
the numFound is 505 so it seems as though the record counts for the group
are correct.  Then I tried the same search but changed the shards parameter
to limit the search to 1/2 of the shards and got numFound = 325.  This
shows that the items in the group are distributed between different shards.

What am I missing here?   What is it that you are saying does not work?

Tom
Field Collapse query ( IP address changed, and newlines added and  shard
urls simplified  for readability)


http://solr-myhost.edu/serve-9/select?indent=onversion=2.2
shards=shard1,shard2,shard3, shard4,shard5, shard,6,...shard12
q=title:naturefq=start=0rows=10fl=id,author,title,volume_enumcron,score
group=truegroup.field=record_nogroup.limit=2


Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Hello,

Usually in the example/solr file in Solr distributions there is a populated
conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
there is no /conf directory.   Has this been moved somewhere?

Tom

ls -l apache-solr-4.0.0-BETA/example/solr
total 107
drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
-rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
-rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
-rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg


RE: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Markus Jelsma
Hi - The example has been moved to collection1/

 
 
-Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Wed 22-Aug-2012 20:59
 To: solr-user@lucene.apache.org
 Subject: Solr 4.0 Beta missing example/conf files?
 
 Hello,
 
 Usually in the example/solr file in Solr distributions there is a populated
 conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
 there is no /conf directory.   Has this been moved somewhere?
 
 Tom
 
 ls -l apache-solr-4.0.0-BETA/example/solr
 total 107
 drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
 drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
 -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
 -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
 -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
 


Cloud assigning incorrect port to shards

2012-08-22 Thread Buttler, David
Hi,
I have set up a Solr 4 beta cloud cluster.  I have uploaded a config directory, 
and linked it with a configuration name.

I have started two solr on two computers and added a couple of shards using the 
Core Admin function on the admin page.

When I go to the admin cloud view, the shards all have the computer name and 
port attached to them.  BUT, the port is the default port (8983), and not the 
port that I assigned on the command line.  I can still connect to the correct 
port, and not the reported port.  I anticipate that this will lead to errors 
when I get to doing distributed query, as zookeeper seems to be collecting 
incorrect information.

Any thoughts as to why the incorrect port is being stored in zookeeper?

Thanks,
Dave


Full Text Indexing for DOCX files

2012-08-22 Thread Nguyen, Vincent (CDC/OD/OADS) (CTR)
Has anyone been able to index DOCX files?  I get this error message when using 
office 2007 documents

(Location of error 
unknown)org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied 
data appears to be in the Office 2007+ XML. POI only supports OLE2 Office 
documents

We're currently using SOLR1.3

Vincent Vu Nguyen




Re: Does DIH commit during large import?

2012-08-22 Thread Erick Erickson
solrconfig.xml has a setting ramBufferSizeMB that can be set
to limit the memory consumed during indexing. When this limit
is reached, the buffers are flushed to the current segment. NOTE:
the segment is NOT closed, there is no implied commit here, and
the data will not be searchable until a commit happens.

Best
Erick

On Wed, Aug 22, 2012 at 7:10 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 Thanks, I will look into autoCommit.

 I assume there are memory implications of not committing? Or is it
 just writing in a separate file and can theoretically do it
 indefinitely?

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Wed, Aug 22, 2012 at 2:42 AM, Lance Norskog goks...@gmail.com wrote:
 Solr has a separate feature called 'autoCommit'. This is configured in
 solrconfig.xml. You can set Solr to commit all documents every N
 milliseconds or every N documents, whichever comes first. If you want
 intermediate commits during a long DIH session, you have to use this
 or make your own script that does commits.

 On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey s...@elyograg.org wrote:
 On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:

 I am doing an import of large records (with large full-text fields)
 and somewhere around 30 records DataImportHandler runs out of
 memory (Heap) on a TIKA import (triggered from custom Processor) and
 does roll-back. I am using store=false and trying some tricks and
 tracking possible memory leaks, but also have a question about DIH
 itself.

 What actually happens when I run DIH on a large (XML Source) job? Does
 it accumulate some sort of status in memory that it commits at the
 end? If so, can I do intermediate commits to drop the memory
 requirements? Or, will it help to do several passes over the same
 dataset and import only particular entries at a time? I am using the
 Solr 4 (alpha) UI, so I can see some of the options there.


 I use Solr 3.5 and a MySQL database for import, so my setup may not be
 completely relevant, but here is my experience.

 Unless you turn on autocommit in solrconfig, documents will not be
 searchable during the import.  If you have commit=true for DIH (which I
 believe is the default), there will be a commit at the end of the import.

 It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
 that is suspected to be a bug in Tika rather than Solr.  The issue details
 talk about some workarounds for those who are familiar with Tika -- I'm not.
 The issue URL:

 https://issues.apache.org/jira/browse/SOLR-2886

 Thanks,
 Shawn




 --
 Lance Norskog
 goks...@gmail.com


Re: Full Text Indexing for DOCX files

2012-08-22 Thread Jack Krupansky

I've indexed Office 2007 .docx using Solr 3.6.

It sounds as if Solr 1.3 had an old release of Tika/POI. No big surprise 
there.


-- Jack Krupansky

-Original Message- 
From: Nguyen, Vincent (CDC/OD/OADS) (CTR)

Sent: Wednesday, August 22, 2012 3:57 PM
To: solr-user@lucene.apache.org
Subject: Full Text Indexing for DOCX files

Has anyone been able to index DOCX files?  I get this error message when 
using office 2007 documents


(Location of error 
unknown)org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied 
data appears to be in the Office 2007+ XML. POI only supports OLE2 Office 
documents


We're currently using SOLR1.3

Vincent Vu Nguyen




RE: Full Text Indexing for DOCX files

2012-08-22 Thread Nguyen, Vincent (CDC/OD/OADS) (CTR)
Thanks Jack, I'll give that version of SOLR a try.

Vincent Vu Nguyen
Web Applications Developer
Division of Science Quality and Translation
Office of the Associate Director for Science
Centers for Disease Control and Prevention (CDC)
404-498-0384 v...@cdc.gov
Century Bldg 2400
Atlanta, GA 30329 


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Wednesday, August 22, 2012 4:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Full Text Indexing for DOCX files

I've indexed Office 2007 .docx using Solr 3.6.

It sounds as if Solr 1.3 had an old release of Tika/POI. No big surprise there.

-- Jack Krupansky

-Original Message-
From: Nguyen, Vincent (CDC/OD/OADS) (CTR)
Sent: Wednesday, August 22, 2012 3:57 PM
To: solr-user@lucene.apache.org
Subject: Full Text Indexing for DOCX files

Has anyone been able to index DOCX files?  I get this error message when using 
office 2007 documents

(Location of error
unknown)org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied 
data appears to be in the Office 2007+ XML. POI only supports OLE2 Office 
documents

We're currently using SOLR1.3

Vincent Vu Nguyen




Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Thanks Markus!

Should the README.txt file in solr/example be updated to reflect this?
Is that something I need to enter a JIRA issue for?

Tom

On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi - The example has been moved to collection1/



 -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Wed 22-Aug-2012 20:59
  To: solr-user@lucene.apache.org
  Subject: Solr 4.0 Beta missing example/conf files?
 
  Hello,
 
  Usually in the example/solr file in Solr distributions there is a
 populated
  conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
  there is no /conf directory.   Has this been moved somewhere?
 
  Tom
 
  ls -l apache-solr-4.0.0-BETA/example/solr
  total 107
  drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
  drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
  -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
  -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
  -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
 



Re: Solr 3.6.1: query performance is slow when asterisk is in the query

2012-08-22 Thread david3s
Ok, I'll take your suggestion, but I would still be really happy if the
wildcard searches behaved a little more intelligent (body:* not looking for
everything in the body). More like when you do q=*:* it doesn't really
search for everything in every field.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-1-query-performance-is-slow-when-asterisk-is-in-the-query-tp4002496p4002743.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Markus Jelsma
Hi,

I would think so. Perhaps something for:
https://issues.apache.org/jira/browse/SOLR-3288 
 

-Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Wed 22-Aug-2012 22:35
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 4.0 Beta missing example/conf files?
 
 Thanks Markus!
 
 Should the README.txt file in solr/example be updated to reflect this?
 Is that something I need to enter a JIRA issue for?
 
 Tom
 
 On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Hi - The example has been moved to collection1/
 
 
 
  -Original message-
   From:Tom Burton-West tburt...@umich.edu
   Sent: Wed 22-Aug-2012 20:59
   To: solr-user@lucene.apache.org
   Subject: Solr 4.0 Beta missing example/conf files?
  
   Hello,
  
   Usually in the example/solr file in Solr distributions there is a
  populated
   conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
   there is no /conf directory.   Has this been moved somewhere?
  
   Tom
  
   ls -l apache-solr-4.0.0-BETA/example/solr
   total 107
   drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
   drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
   -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
   -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
   -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
  
 
 


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Thanks Tirthankar,

So the issue in memory use for sorting.  I'm not sure I understand how
sorting of grouping fields  is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field.  On
the other hand I don't know much about how field collapsing is implemented.

So far the few tests I've made haven't revealed any memory problems.  We
are using very small string fields for grouping and I think that we
probably only have a couple of cases where we are grouping more than a few
thousand docs.   I will try to find a query with a lot of docs per group
and take a look at the memory use using JConsole.

Tom


On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

  Hi Tom,

 We had an issue where we are keeping millions of docs in a single node and
 we were trying to group them on a string field which is nothing but full
 file path… that caused SOLR to go out of memory…

 ** **

 Erick has explained nicely in the thread as to why it won’t work and I had
 to find another way of architecting it. 

 ** **

 How do you think this is different in your case. If you want to group by a
 string field with thousands of similar entries I am guessing you will face
 the same issue. 

 ** **

 Thanks,

 Tirthankar
 ***Legal Disclaimer***
 This communication may contain confidential and privileged material for
 the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution
 by others is strictly prohibited. If you have received the message in
 error,
 please advise the sender by reply email and delete the message. Thank you.
 **



Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Mark Miller
Yeah - we want fix that for sure. 

Sent from my iPhone

On Aug 22, 2012, at 6:34 PM, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi,
 
 I would think so. Perhaps something for:
 https://issues.apache.org/jira/browse/SOLR-3288 
 
 
 -Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Wed 22-Aug-2012 22:35
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 4.0 Beta missing example/conf files?
 
 Thanks Markus!
 
 Should the README.txt file in solr/example be updated to reflect this?
 Is that something I need to enter a JIRA issue for?
 
 Tom
 
 On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
 Hi - The example has been moved to collection1/
 
 
 
 -Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Wed 22-Aug-2012 20:59
 To: solr-user@lucene.apache.org
 Subject: Solr 4.0 Beta missing example/conf files?
 
 Hello,
 
 Usually in the example/solr file in Solr distributions there is a
 populated
 conf file.  However in the distribution I downloaded of solr 4.0.0-BETA,
 there is no /conf directory.   Has this been moved somewhere?
 
 Tom
 
 ls -l apache-solr-4.0.0-BETA/example/solr
 total 107
 drwxr-sr-x 2 tburtonw dlps0 May 29 13:02 bin
 drwxr-sr-x 3 tburtonw dlps   22 Jun 28 09:21 collection1
 -rw-r--r-- 1 tburtonw dlps 2259 May 29 13:02 README.txt
 -rw-r--r-- 1 tburtonw dlps 2171 Jul 31 19:35 solr.xml
 -rw-r--r-- 1 tburtonw dlps  501 May 29 13:02 zoo.cfg
 
 
 


Re: Cloud assigning incorrect port to shards

2012-08-22 Thread Mark Miller
What container are you using?

Sent from my iPhone

On Aug 22, 2012, at 3:14 PM, Buttler, David buttl...@llnl.gov wrote:

 Hi,
 I have set up a Solr 4 beta cloud cluster.  I have uploaded a config 
 directory, and linked it with a configuration name.
 
 I have started two solr on two computers and added a couple of shards using 
 the Core Admin function on the admin page.
 
 When I go to the admin cloud view, the shards all have the computer name and 
 port attached to them.  BUT, the port is the default port (8983), and not the 
 port that I assigned on the command line.  I can still connect to the correct 
 port, and not the reported port.  I anticipate that this will lead to errors 
 when I get to doing distributed query, as zookeeper seems to be collecting 
 incorrect information.
 
 Any thoughts as to why the incorrect port is being stored in zookeeper?
 
 Thanks,
 Dave


Re: Solr Custom Filter Factory - How to pass parameters?

2012-08-22 Thread ksu wildcats
Thanks Erick.
I tried to do it all at the filter but the problem i am running into doing
it at the filter is intercepting the final commit calls or in other words I
am unable to figure out when the final commit should happen such that I
don't miss out any data.
One option I tried is to increase the in-memory batch size and commit the
data from in-memory to database in incrementToken method but this can lead
to missing out data from in-memory if the size of the batch is less than the
set threshold.

I'll try using SolrEventListener and see if that can help resolve the issues
i am running into.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Custom-Filter-Factory-How-to-pass-parameters-handle-PostProcessing-tp4002217p4002768.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Lance Norskog
Yes, distributed grouping works, but grouping takes a lot of
resources. If you can avoid in distributed mode, so much the better.

On Wed, Aug 22, 2012 at 3:35 PM, Tom Burton-West tburt...@umich.edu wrote:
 Thanks Tirthankar,

 So the issue in memory use for sorting.  I'm not sure I understand how
 sorting of grouping fields  is involved with the defaults and field
 collapsing, since the default sorts by relevance not grouping field.  On
 the other hand I don't know much about how field collapsing is implemented.

 So far the few tests I've made haven't revealed any memory problems.  We
 are using very small string fields for grouping and I think that we
 probably only have a couple of cases where we are grouping more than a few
 thousand docs.   I will try to find a query with a lot of docs per group
 and take a look at the memory use using JConsole.

 Tom


 On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee 
 tchatter...@commvault.com wrote:

  Hi Tom,

 We had an issue where we are keeping millions of docs in a single node and
 we were trying to group them on a string field which is nothing but full
 file path… that caused SOLR to go out of memory…

 ** **

 Erick has explained nicely in the thread as to why it won’t work and I had
 to find another way of architecting it. 

 ** **

 How do you think this is different in your case. If you want to group by a
 string field with thousands of similar entries I am guessing you will face
 the same issue. 

 ** **

 Thanks,

 Tirthankar
 ***Legal Disclaimer***
 This communication may contain confidential and privileged material for
 the
 sole use of the intended recipient. Any unauthorized review, use or
 distribution
 by others is strictly prohibited. If you have received the message in
 error,
 please advise the sender by reply email and delete the message. Thank you.
 **




-- 
Lance Norskog
goks...@gmail.com


Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

2012-08-22 Thread ksu wildcats
Thanks for the reply Mikhail.

For our needs the speed is more important than flexibility and we have huge
text files (ex: blogs / articles ~2 MB size) that needs to be read from our
filesystem and then store into the index.

We have our app creating separate core per client (dynamically) and there is
one instance of EmbeddedSolrServer for each core thats used for adding
documents to the index.
Each document has about 10 fields and one of the field has ~2MB data stored
(stored = true, analyzed=true). 
Also we have logic built into our webapp to dynamically create the solr
config files 
(solrConfig  schema per core - filters/analyzers/handler values can be
different for each core)
for each core before creating an instance of EmbeddedSolrServer for that
core.
Another reason to go with EmbeddedSolrServer is to reduce overhead of
transporting large data (~2 MB) over http/xml.

We use this setup for building our master index which then gets replicated
to slave servers 
using replication scripts provided by solr.
We also have solr admin ui integrated into our webapp (using admin jsp 
handlers from solradmin ui)

We have been using this MultiCore setup for more than a year now and so far
we havent run into any issues with EmbeddedSolrServer integrated into our
webapp.
However I am now trying to figure out the impact if we allow multiple
threads sending request to EmbeddedSolrServer (same core) for adding docs to
index simultaneously.

Our understanding was that EmbeddedSolrServer would give us better
performance over http solr for our needs.
Its quite possible that we might be wrong and http solr would have given us
similar/better performance.

Also based on documentation from SolrWiki I am assuming that
EmbeddedSolrServer API is same as the one used by Http Solr.

Said that, can you please tell if there is any specific downside to using
EmbeddedSolrServer that could cause issues for us down the line.

I am also interested in your below comment about indexing 1 million docs in
few mins. Ideally we would like to get to that speed
I am assuming this depends on the size of the doc and type of
analyzer/tokenizer/filters being used. Correct?
Can you please share (or point me to documentation) on how to get this speed
for 1 mil docs.
  - one million is a fairly small amount, in average it should be indexed
 in few mins. I doubt that you really need to distribute indexing

Thanks
-K



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Weighted Search Results / Multi-Value Value's Not Aggregating Weight

2012-08-22 Thread David Radunz

Hey,

Please disregard this, I worked out what the actual problem was. I 
am going to post another query with something else I discovered.


Thanks :)

David

On 22/08/2012 7:24 PM, David Radunz wrote:

Hey,

I have been having some problems getting good search results when 
using weighting against many fields with multi-values. After quite a 
bit of testing it seems to me that the problem is (at least as far as 
my query is concerned) is that the only one weighting is taken into 
account per field. For example, in a multi-value field if we have 
Comedy and Romance and have separate weightings for those - the 
one with the highest weighting is used (and not a combined weighting). 
Which means that searched for romantic comedy returns Alvin and the 
Chipmunks (Family, Children Comedy).


Query:

facet=onfl=id,name,matching_genres,score,url_path,url_key,price,special_price,small_image,thumbnail,sku,stock_qty,release_datesort=score+desc,retail_rating+desc,release_date+descstart=q=**+-sku:1019660+-movie_id:1805+-movie_id:1806+(series_names_attr_opt_id:454282^9000+OR+cat_id:22^9+OR+cat_id:248^9+OR+cat_id:249^9+OR+matching_genres:Comedy^9+OR+matching_genres:Romance^7+OR+matching_genres:Drama^5)fq=store_id:1+AND+avail_status_attr_opt_id:available+AND+(format_attr_opt_id:372619)rows=4 



Now if I change matching_genres:Romance^7 to 
matching_genres:Romance^70 (adding a 0) suddenly the first 
result is Sex and the City: The Movie / Sex and the City 2 (which 
ironically is Drama, Comedy, Romance - The very combination we 
are looking for).


So is there a way to structure my query so that all of the 
multi-value values are treated individually? Aggregating the 
weighting/score?


Thanks in advance!

David