Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread onetwothree
Thanks for the reply, dropbox image added.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262p4112403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Optimizing index on Slave

2014-01-20 Thread Salman Akram
All,

I know normally index should be optimized on master and it will be
replicated to slave but we have an issue with the network bandwidth.

We optimize indexes weekly (total size is around 1.5TB). We have few slaves
set up on local network so replication the whole indexes is not a big
issue.

However, we have one slave in another city too (on a backup network) which
of course gets replicated over internet which is quite slow and expensive.
We want to avoid copying the complete indexes every week after optimization
and were thinking if its possible to optimize it independently on slave so
that there is no delta between master and slave? We tried to do it but
still the slave replicated from master.


-- 
Regards,

Salman Akram


Re: Multi Lingual Analyzer

2014-01-20 Thread Benson Margulies
MT is not nearly good enough to allow approach 1 to work.

On Mon, Jan 20, 2014 at 9:25 AM, Erick Erickson  wrote:
> It Depends (tm). Approach (2) will give you better, more specific
> search results. (1) is simpler to implement and might be "good
> enough"...
>
>
>
> On Mon, Jan 20, 2014 at 5:21 AM, David Philip
>  wrote:
>> Hi,
>>
>>
>>
>>   I have a query on Multi-Lingual Analyser.
>>
>>
>>  Which one of the  below is the best approach?
>>
>>
>> 1.1.To develop a translator that translates a/any language to
>> English and then use standard English analyzer to analyse – use translator,
>> both at index time and while search time?
>>
>> 2.  2.  To develop a language specific analyzer and use that by
>> creating specific field only for that language?
>>
>> We have client data coming in different Languages: Kannada and Telegu and
>> others later.This data is basically the text written by customer in that
>> language.
>>
>>
>> Requirement is to develop analyzers particular for these language.
>>
>>
>>
>> Thanks - David


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
4.6.0


On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller  wrote:

> What version are you running?
>
> - Mark
>
> On Jan 20, 2014, at 5:43 PM, Software Dev 
> wrote:
>
> > We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
> > updates get sent to one machine or something?
> >
> >
> > On Mon, Jan 20, 2014 at 2:42 PM, Software Dev  >wrote:
> >
> >> We commit have a soft commit every 5 seconds and hard commit every 30.
> As
> >> far as docs/second it would guess around 200/sec which doesn't seem that
> >> high.
> >>
> >>
> >> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >>
> >>> Questions: How often do you commit your updates? What is your
> >>> indexing rate in docs/second?
> >>>
> >>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
> >>> server is having trouble keeping up with updates, switching to CUSS
> >>> probably wouldn't help.
> >>>
> >>> So I suspect there's something not optimal about your setup that's
> >>> the culprit.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev <
> static.void@gmail.com>
> >>> wrote:
>  We are testing our shiny new Solr Cloud architecture but we are
>  experiencing some issues when doing bulk indexing.
> 
>  We have 5 solr cloud machines running and 3 indexing machines
> (separate
>  from the cloud servers). The indexing machines pull off ids from a
> queue
>  then they index and ship over a document via a CloudSolrServer. It
> >>> appears
>  that the indexers are too fast because the load (particularly disk io)
> >>> on
>  the solr cloud machines spikes through the roof making the entire
> >>> cluster
>  unusable. It's kind of odd because the total index size is not even
>  large..ie, < 10GB. Are there any optimization/enhancements I could try
> >>> to
>  help alleviate these problems?
> 
>  I should note that for the above collection we have only have 1 shard
> >>> thats
>  replicated across all machines so all machines have the full index.
> 
>  Would we benefit from switching to a ConcurrentUpdateSolrServer where
> >>> all
>  updates get sent to 1 machine and 1 machine only? We could then remove
> >>> this
>  machine from our cluster than that handles user requests.
> 
>  Thanks for any input.
> >>>
> >>
> >>
>
>


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Mark Miller
What version are you running?

- Mark

On Jan 20, 2014, at 5:43 PM, Software Dev  wrote:

> We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
> updates get sent to one machine or something?
> 
> 
> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev 
> wrote:
> 
>> We commit have a soft commit every 5 seconds and hard commit every 30. As
>> far as docs/second it would guess around 200/sec which doesn't seem that
>> high.
>> 
>> 
>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
>> wrote:
>> 
>>> Questions: How often do you commit your updates? What is your
>>> indexing rate in docs/second?
>>> 
>>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>>> server is having trouble keeping up with updates, switching to CUSS
>>> probably wouldn't help.
>>> 
>>> So I suspect there's something not optimal about your setup that's
>>> the culprit.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
>>> wrote:
 We are testing our shiny new Solr Cloud architecture but we are
 experiencing some issues when doing bulk indexing.
 
 We have 5 solr cloud machines running and 3 indexing machines (separate
 from the cloud servers). The indexing machines pull off ids from a queue
 then they index and ship over a document via a CloudSolrServer. It
>>> appears
 that the indexers are too fast because the load (particularly disk io)
>>> on
 the solr cloud machines spikes through the roof making the entire
>>> cluster
 unusable. It's kind of odd because the total index size is not even
 large..ie, < 10GB. Are there any optimization/enhancements I could try
>>> to
 help alleviate these problems?
 
 I should note that for the above collection we have only have 1 shard
>>> thats
 replicated across all machines so all machines have the full index.
 
 Would we benefit from switching to a ConcurrentUpdateSolrServer where
>>> all
 updates get sent to 1 machine and 1 machine only? We could then remove
>>> this
 machine from our cluster than that handles user requests.
 
 Thanks for any input.
>>> 
>> 
>> 



Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
updates get sent to one machine or something?


On Mon, Jan 20, 2014 at 2:42 PM, Software Dev wrote:

> We commit have a soft commit every 5 seconds and hard commit every 30. As
> far as docs/second it would guess around 200/sec which doesn't seem that
> high.
>
>
> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
> wrote:
>
>> Questions: How often do you commit your updates? What is your
>> indexing rate in docs/second?
>>
>> In a SolrCloud setup, you should be using a CloudSolrServer. If the
>> server is having trouble keeping up with updates, switching to CUSS
>> probably wouldn't help.
>>
>> So I suspect there's something not optimal about your setup that's
>> the culprit.
>>
>> Best,
>> Erick
>>
>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
>> wrote:
>> > We are testing our shiny new Solr Cloud architecture but we are
>> > experiencing some issues when doing bulk indexing.
>> >
>> > We have 5 solr cloud machines running and 3 indexing machines (separate
>> > from the cloud servers). The indexing machines pull off ids from a queue
>> > then they index and ship over a document via a CloudSolrServer. It
>> appears
>> > that the indexers are too fast because the load (particularly disk io)
>> on
>> > the solr cloud machines spikes through the roof making the entire
>> cluster
>> > unusable. It's kind of odd because the total index size is not even
>> > large..ie, < 10GB. Are there any optimization/enhancements I could try
>> to
>> > help alleviate these problems?
>> >
>> > I should note that for the above collection we have only have 1 shard
>> thats
>> > replicated across all machines so all machines have the full index.
>> >
>> > Would we benefit from switching to a ConcurrentUpdateSolrServer where
>> all
>> > updates get sent to 1 machine and 1 machine only? We could then remove
>> this
>> > machine from our cluster than that handles user requests.
>> >
>> > Thanks for any input.
>>
>
>


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We commit have a soft commit every 5 seconds and hard commit every 30. As
far as docs/second it would guess around 200/sec which doesn't seem that
high.


On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson wrote:

> Questions: How often do you commit your updates? What is your
> indexing rate in docs/second?
>
> In a SolrCloud setup, you should be using a CloudSolrServer. If the
> server is having trouble keeping up with updates, switching to CUSS
> probably wouldn't help.
>
> So I suspect there's something not optimal about your setup that's
> the culprit.
>
> Best,
> Erick
>
> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
> wrote:
> > We are testing our shiny new Solr Cloud architecture but we are
> > experiencing some issues when doing bulk indexing.
> >
> > We have 5 solr cloud machines running and 3 indexing machines (separate
> > from the cloud servers). The indexing machines pull off ids from a queue
> > then they index and ship over a document via a CloudSolrServer. It
> appears
> > that the indexers are too fast because the load (particularly disk io) on
> > the solr cloud machines spikes through the roof making the entire cluster
> > unusable. It's kind of odd because the total index size is not even
> > large..ie, < 10GB. Are there any optimization/enhancements I could try to
> > help alleviate these problems?
> >
> > I should note that for the above collection we have only have 1 shard
> thats
> > replicated across all machines so all machines have the full index.
> >
> > Would we benefit from switching to a ConcurrentUpdateSolrServer where all
> > updates get sent to 1 machine and 1 machine only? We could then remove
> this
> > machine from our cluster than that handles user requests.
> >
> > Thanks for any input.
>


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Erick Erickson
Questions: How often do you commit your updates? What is your
indexing rate in docs/second?

In a SolrCloud setup, you should be using a CloudSolrServer. If the
server is having trouble keeping up with updates, switching to CUSS
probably wouldn't help.

So I suspect there's something not optimal about your setup that's
the culprit.

Best,
Erick

On Mon, Jan 20, 2014 at 4:00 PM, Software Dev  wrote:
> We are testing our shiny new Solr Cloud architecture but we are
> experiencing some issues when doing bulk indexing.
>
> We have 5 solr cloud machines running and 3 indexing machines (separate
> from the cloud servers). The indexing machines pull off ids from a queue
> then they index and ship over a document via a CloudSolrServer. It appears
> that the indexers are too fast because the load (particularly disk io) on
> the solr cloud machines spikes through the roof making the entire cluster
> unusable. It's kind of odd because the total index size is not even
> large..ie, < 10GB. Are there any optimization/enhancements I could try to
> help alleviate these problems?
>
> I should note that for the above collection we have only have 1 shard thats
> replicated across all machines so all machines have the full index.
>
> Would we benefit from switching to a ConcurrentUpdateSolrServer where all
> updates get sent to 1 machine and 1 machine only? We could then remove this
> machine from our cluster than that handles user requests.
>
> Thanks for any input.


Re: Facet count mismatch.

2014-01-20 Thread Ahmet Arslan
Hi Luis,

Do you have deletions? What happens when you expunge Deletes?

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22

Ahmet


On Monday, January 20, 2014 10:08 PM, Luis Cappa Banda  
wrote:

Hello!

I've installed a classical two shards Solr 4.5 topology without SolrCloud
balancing with an HA proxy. I've got a *copyField* like this:

* *

Copied from this one:

* *

* *
*    *
* *
* *
* *
* *
* *
* *
*       *
*    *


When faceting with *tagValues* field I've got a total count of 3:


   - facet_counts:
   {
      - facet_queries: { },
      - facet_fields:
      {
         - tagsValues:
         [
            - "sucks",
            - 3
            ]
         },
      - facet_dates: { },
      - facet_ranges: { }
      }



Bug when searching like this with *tagValues* the total number of documents
is not three, but two:



   - params:
   {
      - facet: "true",
      - shards:
      "solr1.test:8081/comments/data,solr2.test:8080/comments/data",
      - facet.mincount: "1",
      - facet.sort: "count",
      - q: "tagsValues:"sucks"",
      - facet.limit: "-1",
      - facet.field: "tagsValues",
      - wt: "json"
      }



Any idea of what's happening here? I'm confused, :-/

Regards,


-- 
- Luis Cappa


Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We are testing our shiny new Solr Cloud architecture but we are
experiencing some issues when doing bulk indexing.

We have 5 solr cloud machines running and 3 indexing machines (separate
from the cloud servers). The indexing machines pull off ids from a queue
then they index and ship over a document via a CloudSolrServer. It appears
that the indexers are too fast because the load (particularly disk io) on
the solr cloud machines spikes through the roof making the entire cluster
unusable. It's kind of odd because the total index size is not even
large..ie, < 10GB. Are there any optimization/enhancements I could try to
help alleviate these problems?

I should note that for the above collection we have only have 1 shard thats
replicated across all machines so all machines have the full index.

Would we benefit from switching to a ConcurrentUpdateSolrServer where all
updates get sent to 1 machine and 1 machine only? We could then remove this
machine from our cluster than that handles user requests.

Thanks for any input.


Facet count mismatch.

2014-01-20 Thread Luis Cappa Banda
Hello!

I've installed a classical two shards Solr 4.5 topology without SolrCloud
balancing with an HA proxy. I've got a *copyField* like this:

* *

Copied from this one:

* *

* *
**
* *
* *
* *
* *
* *
* *
*   *
**


When faceting with *tagValues* field I've got a total count of 3:


   - facet_counts:
   {
  - facet_queries: { },
  - facet_fields:
  {
 - tagsValues:
 [
- "sucks",
- 3
]
 },
  - facet_dates: { },
  - facet_ranges: { }
  }



Bug when searching like this with *tagValues* the total number of documents
is not three, but two:



   - params:
   {
  - facet: "true",
  - shards:
  "solr1.test:8081/comments/data,solr2.test:8080/comments/data",
  - facet.mincount: "1",
  - facet.sort: "count",
  - q: "tagsValues:"sucks"",
  - facet.limit: "-1",
  - facet.field: "tagsValues",
  - wt: "json"
  }



Any idea of what's happening here? I'm confused, :-/

Regards,


-- 
- Luis Cappa


Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Shawn Heisey

On 1/20/2014 3:02 AM, onetwothree wrote:

OS Windows server 2008

4 Cpu
8 GB Ram





We're using a .Net Service (based on Solr.Net) for updating and inserting
documents on a single Solr Core instance. The size of documents sent to Solr
vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
or multiple threads. The current size of the Solr Index is about 15GB.

The indexing service is running around 4 a 5 hours per day, to complete all
inserts and updates to Solr. While the indexing process is running the
Tomcat process memory usage keeps growing up to > 7GB Ram (using Process
Explorer monitor tool) and does not reduce, even after 24 hours. After a
restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
process, the memory usage of Tomcat seems ok, memory consumption is in range
of defined jvm startup params (see image).

So it seems that filesystem buffers are consuming all the leftover memory??,
and don't release memory, even after a quite amount of time? Is there a way
handle this behaviour, in a way that not all memory is consumed? Are there
other alternatives? Best practices?




That picture seems to be a very low-res copy of your screenshot.  I 
can't really make it out.  I can tell you that it's completely normal 
for the OS disk cache (the filesystem buffers you mention) to take up 
all leftover memory.  If an application requests some of that memory, 
the OS will instantly give it up.


First, I'm going to explain something about memory reporting and Solr 
that I've noticed, then I will give you some news you probably won't like.


The numbers reported by visualvm are a true picture of Java heap memory 
usage.  The actual memory usage for Solr will be just a little bit more 
than those numbers.  In the newest versions of Solr, there seems to be a 
side effect of the Java MMAP implementation that results in incorrect 
memory usage reporting at the operating system level.  Here's a "top" 
output on one of my Solr servers running CentOS, sorted by memory 
usage.  The process at the top of the list is Solr.


https://www.dropbox.com/s/y1nus7lpzlb1mp9/solr-memory-usage-2014-01-20%2010.28.28.png

Some quick numbers for you:  The machine has 64GB of RAM.  Solr shows a 
virtual memory size of 59.2GB.  My indexes take up 51293336 of disk 
space, and Solr has a 6GB heap, so 59.2GB is not out of line for the 
virtual memory size.


Now for where things get weird: There is 48GB of RAM taken up by the 
"cached" value, which is the OS disk cache.  The screenshot also shows 
that Solr is using 22GB of resident RAM.  If you add the 48GB in the OS 
disk cache and the 22GB of resident RAM for Solr, you get 70GB ... which 
is more memory than the machine even HAS, so we know something's off.  
The 'shared' memory for Solr is 15GB, which when you subtract it from 
the 22GB, gives you 7GB, which is much more realistic with a 6GB heap, 
and also makes it fit within the total system RAM.


The news that you probably won't like:

I'm assuming that the whole reason you looked into memory usage was 
because you're having performance problems.  With 8GB of RAM and 3GB 
given to Solr, you basically have a little bit less than 5GB of RAM for 
the OS disk cache.  With that much RAM, most people can effectively 
cache an index up to about 10GB before performance problems show up.  
Your index is 15GB.  You need more total system RAM.  If Solr isn't 
crashing, you can probably leave the heap at 3GB with no problem.


http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: Changing existing index to use block-join

2014-01-20 Thread Mikhail Khludnev
On Mon, Jan 20, 2014 at 6:11 PM,  wrote:

>
> Zitat von Mikhail Khludnev :
>
>  On Sat, Jan 18, 2014 at 11:25 PM,  wrote:
>>
>>  So, my question now: can I change my existing index in just adding a
>>> is_parent and a _root_ field and saving the journal id there like I did
>>> with j-id or do I have to reindex all my documents?
>>>
>>>
>> Absolutely, to use block-join you need to index nested documents as
>> blocks,
>> as it's described at
>> http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg
>> https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml
>>
>>
> Thank you for the clarification.
> But there is no way to add new children without indexing the parent
> document and all existing childs again?
>
Yes. There is no way to add children incrementally. You need to nuke whole
block and add it with all necessary children.


>
> So, in the example on github, if I want to add new sizes and colors to an
> existing T-Shirt, I have to reindex the already existing T-Shirt and all
> it's variations again?
>
Completely reindex t-shirts with all skus.


>
> I understand that the blocks are created at index time, so I can't change
> an existing index to build blocks just in adding the _root_ field, but I
> don't get why it's not possible to add new children or did I missinterpret
> your statement?
>

Block join relies on internal Lucene docnums which are defined by the order
in which documents has been indexed.

this might help
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene


>
> Thanks,
> -Gesh
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Getting all search words relevant for the document to be found

2014-01-20 Thread Tomaz Kveder
Hi!

I need a little help from you. 

We have complex documents stored in database. On the page we show them from
database. We index them and not store them in Solr. So we can't use Solr
Highlighter. But still we would like to highlight the search words found in
the document. What approach would you suggest? 

Our approuch and idea is hidden in this basic question:
Is it possible to get the list of all search words with which the specific
document was found (with all the language varieties of the word). 

Let me explain what I mean with simplefied example. We index the sentence:
"The big cloud is verry dark". User puts these two words in search box:
"clouds" "dark" "rain". 

Can I get from Solr that that particular document was found because of words
"cloud" and "dark". So we can highlight them in the content. 

Ofcourse we can highlight the exact words user putted in search filed. But
that's not enough. We woul also like to highlight all the language varieties
that the document was found on.

Thanks!

Best regards,

Tomaz




[OT] Use Cases for Taming Text, 2nd ed.

2014-01-20 Thread Grant Ingersoll
Hi Solr Users,

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of 
Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting 
interested parties who would be willing to contribute to a chapter on practical 
use cases (i.e. you have something in production and are willing to write about 
it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine 
learning using Mahout, OpenNLP or MALLET -- ideally you are using combinations 
of 2 or more of these to solve your problems.  We are especially interested in 
large scale use cases in eCommerce, Advertising, social media analytics, fraud, 
etc.

The writing process is fairly straightforward.  A section roughly equates to 
somewhere between 3 - 10 pages, including diagrams/pictures.  After writing, 
there will be some feedback from editors and us, but otherwise the process is 
fairly simple.

In order to participate, you must have permission from your company to write on 
the topic.  You would not need to divulge any proprietary information, but we 
would want enough information for our readers to gain a high-level 
understanding of your use case.  In exchange for your participation, you will 
have your name and company published on that section of the book as well as in 
the acknowledgments section.  If you have a copy of Lucene in Action or Mahout 
In Action, it would be similar to the use case sections in those books.

If you are interested, please respond privately to me using my 
gsing...@apache.org email address with this subject line.

Thanks,
Grant, Drew, Tom







Re: Multi Lingual Analyzer

2014-01-20 Thread Erick Erickson
It Depends (tm). Approach (2) will give you better, more specific
search results. (1) is simpler to implement and might be "good
enough"...



On Mon, Jan 20, 2014 at 5:21 AM, David Philip
 wrote:
> Hi,
>
>
>
>   I have a query on Multi-Lingual Analyser.
>
>
>  Which one of the  below is the best approach?
>
>
> 1.1.To develop a translator that translates a/any language to
> English and then use standard English analyzer to analyse – use translator,
> both at index time and while search time?
>
> 2.  2.  To develop a language specific analyzer and use that by
> creating specific field only for that language?
>
> We have client data coming in different Languages: Kannada and Telegu and
> others later.This data is basically the text written by customer in that
> language.
>
>
> Requirement is to develop analyzers particular for these language.
>
>
>
> Thanks - David


Re: Changing existing index to use block-join

2014-01-20 Thread dev


Zitat von Mikhail Khludnev :


On Sat, Jan 18, 2014 at 11:25 PM,  wrote:


So, my question now: can I change my existing index in just adding a
is_parent and a _root_ field and saving the journal id there like I did
with j-id or do I have to reindex all my documents?



Absolutely, to use block-join you need to index nested documents as blocks,
as it's described at
http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg
https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml



Thank you for the clarification.
But there is no way to add new children without indexing the parent  
document and all existing childs again?


So, in the example on github, if I want to add new sizes and colors to  
an existing T-Shirt, I have to reindex the already existing T-Shirt  
and all it's variations again?


I understand that the blocks are created at index time, so I can't  
change an existing index to build blocks just in adding the _root_  
field, but I don't get why it's not possible to add new children or  
did I missinterpret your statement?


Thanks,
-Gesh



RE: Indexing URLs from websites

2014-01-20 Thread Markus Jelsma
Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 
 
-Original message-
> From:Teague James 
> Sent: Friday 17th January 2014 18:13
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> Progress!
> 
> I changed the value of that property in nutch-default.xml and I am getting 
> the anchor field now. However, the stuff going in there is a bit random and 
> doesn't seem to correlate to the pages I'm crawling. The primary objective is 
> that when there is something on the page that is a link to a file 
> ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual 
> code in the email) I want to capture that URL and the anchor text "Get the 
> PDF!" into field(s).
> 
> Am I going in the right direction on this?
> 
> Thank you so much for sticking with me on this - I really appreciate your 
> help!
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Friday, January 17, 2014 6:46 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
> 
> 
> 
>  
>  
> -Original message-
> > From:Teague James 
> > Sent: Thursday 16th January 2014 20:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I had used that previously and I just tried it again. The following 
> > generated no errors:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > Solr is still not getting an anchor field and the outlinks are not 
> > appearing in the index anywhere else.
> > 
> > To be sure I deleted the crawl directory and did a fresh crawl using:
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > 
> > Then
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > No errors, but no anchor fields or outlinks. One thing in the response from 
> > the crawl that I found interesting was a line that said:
> > 
> > LinkDb: internal links will be ignored.
> 
> Good catch! That is likely the problem. 
> 
> > 
> > What does that mean?
> 
> 
>   db.ignore.internal.links
>   true
>   If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   
> 
> 
> So change the property, rebuild the linkdb and try reindexing once again :)
> 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, January 16, 2014 11:08 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Indexing URLs from websites
> > 
> > Usage: SolrIndexer   [-linkdb ] [-params 
> > k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] 
> > [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] 
> > [-filter] [-normalize]
> > 
> > You must point to the linkdb via the -linkdb parameter. 
> >  
> > -Original message-
> > > From:Teague James 
> > > Sent: Thursday 16th January 2014 16:57
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Okay. I changed my solrindex to this:
> > > 
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > crawl/linkdb
> > > crawl/segments/20140115143147
> > > 
> > > I got the same errors:
> > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > > Java stacktrace
> > > 
> > > Those linkdb folders are not being created.
> > > 
> > > -Original Message-
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: Thursday, January 16, 2014 10:44 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Hi - you cannot use wildcards for segments. You need to give one segment 
> > > or a -dir segments_dir. Check the usage of your indexer command. 
> > >  
> > > -Original message-
> > > > From:Teague James 
> > > > Sent: Thursday 16th January 2014 16:43
> > > > To: solr-user@lucene.apache.org
> > > > Subject: RE: Indexing URLs from websites
> > > > 
> > > > Hello Markus,
> > > > 
> > > > I do get a linkdb folder in the crawl folder that gets created - but it 
> > > > is created at the time that I execute the command automatically by 
> > > > Nutch. I just tried to use solrindex against yesterday's cawl and did 
> > > > not get any errors, but did not get the anchor field or any of the 
> > > > outlinks. I used this command:
> > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkd

Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Toke Eskildsen
On Mon, 2014-01-20 at 11:02 +0100, onetwothree wrote:
> Optional JVM parameters set xmx = 3072, xms = 1024
> directoryFactory: MMapDirectory

[...]

> So it seems that filesystem buffers are consuming all the leftover memory??,
> and don't release memory, even after a quite amount of time?

As long as the memory is indeed leftover, that is the optimal strategy.
Maybe Uwe's explanation of MMapDirectory will help:

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Regards,
Toke Eskildsen, State and University Library, Denmark




Re: Error when creating collection in Solr 4.6

2014-01-20 Thread Uwe Reh

Hi,

I had the same problem.
In my case the error was, I had a copy/paste typo in my solr.xml.

"${genericCoreNodeNames:true}"
!^! Ouch!

With the type 'bool' instead of 'str' it works definitely better. ;-)

Uwe



Am 28.11.2013 08:53, schrieb lansing:

Thank you for your replies,
I am using the new-style discovery
It worked after adding this setting :
${genericCoreNodeNames:true}





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Query by range of price

2014-01-20 Thread rachun
Thank you very much Mr. Raymond

You just saved my world ;)
It's worked and *sort by conditions *
but facet.query=price_min:[* TO 1300] not working yet but I will try to
google for the right solution.

Million thanks _/|\_
Rachun.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112272.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search Suggestion Filtering

2014-01-20 Thread Alessandro Benedetti
Hi guys, following this thread I have some question :

1) regarding LUCENE-5350, what is the context quoted ? Is it the context a
filter query ?

2) regarding https://issues.apache.org/jira/browse/SOLR-5378, do we have
the final documentation available ?

Cheers


2014/1/16 Hamish Campbell 

> Thank you Jorge. We looked at phrase suggestions from previous user
> queries, but they're not so useful in our case. However, I have a follow-up
> question about similar functionality that I'll post shortly.
>
> The list might like to know that I've come up with a quick and exceedingly
> dirty hack solution that works for our limited case.
>
> You have been warned!
>
> Note that we're using django-haystack to actually interact with Solr:
>
> 1. Set nonFuzzyPrefix of the Suggester to 4.
> 2. At index time, the haystack index will build suggestion terms by
> extracting the relevant terms and prefixing with a 4 (alpha) character
> reference for the target instance.
> 3. At search time, the user's query is split, terms are prefixed and
> concatenated. The new query is sent to solr and the results are cleaned of
> references before returned to the front end.
>
> I'm not proud of it, but it works. =D
>
>
>
> On Fri, Jan 17, 2014 at 3:13 AM, Jorge Luis Betancourt González <
> jlbetanco...@uci.cu> wrote:
>
> > In a custom application we have, we use a separated core (under Solr
> > 3.6.1) to store the queries used by the users and then provide the
> > autocomplete feauture. In our case we need to filter some phrases, that
> we
> > don't need to be suggested to the users. I build a custom
> > UpdateRequestProcessor to implement this logic, so we define this
> "blocking
> > patterns" in some external source of information (DB, files, etc.). For
> the
> > suggestions per-se we use as a base
> > https://github.com/cominvent/autocomplete configuration, described in
> > www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
> > which is pretty usable as it comes. I found (personally) this approach
> way
> > more flexible than the original suggester component, but it involves
> > storing the user's queries into a separated core.
> >
> > Greetings,
> >
> > - Original Message -
> > From: "Hamish Campbell" 
> > To: solr-user@lucene.apache.org
> > Sent: Wednesday, January 15, 2014 9:10:16 PM
> > Subject: Re: Search Suggestion Filtering
> >
> > Thanks Tomás, I'll take a look.
> >
> > Still interested to hear from anyone about using queries to populate the
> > list - I'm willing to give up a bit of performance for the flexibility it
> > would provide.
> >
> >
> > On Thu, Jan 16, 2014 at 1:06 PM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> > > I think your use case is the one described in LUCENE-5350, maybe you
> want
> > > to take a look to the patch and comments there.
> > >
> > > Tomás
> > >
> > >
> > > On Wed, Jan 15, 2014 at 12:58 PM, Hamish Campbell <
> > > hamish.campb...@koordinates.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm looking into options for filtering the search suggestions
> > dictionary.
> > > >
> > > > Using Solr 4.6.0, Suggester component and fst.FuzzyLookupFactory
> using
> > a
> > > > field based dictionary, we're indexing records for a multi-tenanted
> > SaaS
> > > > platform. SearchHandler records are always filtered by the particular
> > > > client warehouse (e.g. by domain), however we need a way to apply a
> > > similar
> > > > filter to the spell check dictionary to prevent leaking terms between
> > > > clients. In other words: when client A searches for a document title
> > they
> > > > should not receive spelling suggestions for client B's document
> titles.
> > > >
> > > > This has been asked a couple of times, on the mailing list and on
> > > > StackOverflow. Some of the suggested approaches:
> > > >
> > > > 1. Use dynamic fields to create dictionaries per-warehouse (mentioned
> > > here:
> > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Filtering-down-terms-in-suggest-tt4069627.html
> > > > )
> > > >
> > > > That might be a reasonable option for us (we already considered a
> > similar
> > > > approach), but at what point does this stop scaling efficiently? How
> > many
> > > > dynamic fields are too many?
> > > >
> > > > 2. Run a query to populate the suggestion list (also mentioned in
> that
> > > > thread)
> > > >
> > > > If I understand this correctly, this would give us a lot of
> flexibility
> > > and
> > > > power: for example to give a more nuanced result set using the users
> > > > permissions to expose private documents in their spelling
> suggestions.
> > > >
> > > > I expect this would be a slow query, but our total document count is
> > > > currently relatively small (on the order of 10^3 objects) and I
> imagine
> > > you
> > > > could create a specific word index with the appropriate fields to
> keep
> > > this
> > > > in check. Is this a feasible approach, and if so, how do you build a
> > > > dynamic suggestion list?
> > > >

Multi Lingual Analyzer

2014-01-20 Thread David Philip
Hi,



  I have a query on Multi-Lingual Analyser.


 Which one of the  below is the best approach?


1.1.To develop a translator that translates a/any language to
English and then use standard English analyzer to analyse – use translator,
both at index time and while search time?

2.  2.  To develop a language specific analyzer and use that by
creating specific field only for that language?

We have client data coming in different Languages: Kannada and Telegu and
others later.This data is basically the text written by customer in that
language.


Requirement is to develop analyzers particular for these language.



Thanks - David


Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Yago Riveiro
Other thing, Solr use a lot the OS cache to cache the index and gain 
performance. This can be another reason why the solr process has a high memory 
value allocated.


/yago
—
/Yago Riveiro

On Mon, Jan 20, 2014 at 10:03 AM, onetwothree 
wrote:

> Facts:
> OS Windows server 2008
> 4 Cpu
> 8 GB Ram
> Tomcat Service version 7.0 (64 bit)
> Only running Solr
> Optional JVM parameters set xmx = 3072, xms = 1024
> Solr version 4.5.0.
> One Core instance (both for querying and indexing)
> *Schema config:*
> minGramSize="2" maxGramSize="20"
> most of the fields are stored = "true" (required)
> *Solr config:*
> ramBufferSizeMB: 100
> maxIndexingThreads: 8
> directoryFactory: MMapDirectory
> autocommit: maxdocs 1, maxtime 15000, opensearcher false
> cache (defaults): 
> filtercache initialsize:512 size: 512 autowarm: 0
> queryresultcache initialsize:512 size: 512 autowarm: 0
> documentcache initialsize:512 size: 512 autowarm: 0
> Problem description:
> We're using a .Net Service (based on Solr.Net) for updating and inserting
> documents on a single Solr Core instance. The size of documents sent to Solr
> vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
> or multiple threads. The current size of the Solr Index is about 15GB.
> The indexing service is running around 4 a 5 hours per day, to complete all
> inserts and updates to Solr. While the indexing process is running the
> Tomcat process memory usage keeps growing up to > 7GB Ram (using Process
> Explorer monitor tool) and does not reduce, even after 24 hours. After a
> restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
> to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
> process, the memory usage of Tomcat seems ok, memory consumption is in range
> of defined jvm startup params (see image).
> So it seems that filesystem buffers are consuming all the leftover memory??,
> and don't release memory, even after a quite amount of time? Is there a way
> handle this behaviour, in a way that not all memory is consumed? Are there
> other alternatives? Best practices?
>  
> Thanks in advance
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Yago Riveiro
The fact that you see the memory consumed too high should be consecuency of 
that some memory of the heap is only released after a full GC. With the 
VisualVM tool you can try to force a full GC and see if the memory is released.


/yago
—
/Yago Riveiro

On Mon, Jan 20, 2014 at 10:03 AM, onetwothree 
wrote:

> Facts:
> OS Windows server 2008
> 4 Cpu
> 8 GB Ram
> Tomcat Service version 7.0 (64 bit)
> Only running Solr
> Optional JVM parameters set xmx = 3072, xms = 1024
> Solr version 4.5.0.
> One Core instance (both for querying and indexing)
> *Schema config:*
> minGramSize="2" maxGramSize="20"
> most of the fields are stored = "true" (required)
> *Solr config:*
> ramBufferSizeMB: 100
> maxIndexingThreads: 8
> directoryFactory: MMapDirectory
> autocommit: maxdocs 1, maxtime 15000, opensearcher false
> cache (defaults): 
> filtercache initialsize:512 size: 512 autowarm: 0
> queryresultcache initialsize:512 size: 512 autowarm: 0
> documentcache initialsize:512 size: 512 autowarm: 0
> Problem description:
> We're using a .Net Service (based on Solr.Net) for updating and inserting
> documents on a single Solr Core instance. The size of documents sent to Solr
> vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
> or multiple threads. The current size of the Solr Index is about 15GB.
> The indexing service is running around 4 a 5 hours per day, to complete all
> inserts and updates to Solr. While the indexing process is running the
> Tomcat process memory usage keeps growing up to > 7GB Ram (using Process
> Explorer monitor tool) and does not reduce, even after 24 hours. After a
> restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
> to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
> process, the memory usage of Tomcat seems ok, memory consumption is in range
> of defined jvm startup params (see image).
> So it seems that filesystem buffers are consuming all the leftover memory??,
> and don't release memory, even after a quite amount of time? Is there a way
> handle this behaviour, in a way that not all memory is consumed? Are there
> other alternatives? Best practices?
>  
> Thanks in advance
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
> Sent from the Solr - User mailing list archive at Nabble.com.

LSH in Solr/Lucene

2014-01-20 Thread Shashi Kant
Hi folks, have any of you successfully implemented LSH (MinHash) in
Solr? If so, could you share some details of how you went about it?

I know LSH is available in Mahout, but was hoping if someone has a
solr or Lucene implementation.

Thanks


Re: Query by range of price

2014-01-20 Thread Raymond Wiker
Followup: I *think* something like this should work:

$results = $solr->search($query, $start, $rows, array('sort' => 'price_min
asc,update_date desc', 'facet.query' => 'price_min:[* TO 1300]'));


On Mon, Jan 20, 2014 at 11:05 AM, Raymond Wiker  wrote:

> That's exactly what I would expect from url-encoding '&'. So, the thing
> that you're doing works as it should, but you're probably doing something
> that you should not do (in this case, urlencode).
>
> I have not used SolrPHPClient myself, but from the example at
> http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it
>  appears that you should not do any urlencoding yourself, at all.
> Further, if you're using data that is already urlencoded, you should
> urldecode it before handing it over to SolrPHPClient.
>
>
> On Mon, Jan 20, 2014 at 10:34 AM, rachun  wrote:
>
>> Hi Raymond,
>>
>> I keep trying to encode the '&'  but when I look at the solar log it show
>> me
>> that '%26'
>> I'm using urlencode it didn't work what should i do? Im using
>> SolrPHPClient.
>> Please suggest me.
>>
>> Thank you very much,
>> Rachun
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Query by range of price

2014-01-20 Thread Raymond Wiker
That's exactly what I would expect from url-encoding '&'. So, the thing
that you're doing works as it should, but you're probably doing something
that you should not do (in this case, urlencode).

I have not used SolrPHPClient myself, but from the example at
http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it
appears that you should not do any urlencoding yourself, at all.
Further, if you're using data that is already urlencoded, you should
urldecode it before handing it over to SolrPHPClient.


On Mon, Jan 20, 2014 at 10:34 AM, rachun  wrote:

> Hi Raymond,
>
> I keep trying to encode the '&'  but when I look at the solar log it show
> me
> that '%26'
> I'm using urlencode it didn't work what should i do? Im using
> SolrPHPClient.
> Please suggest me.
>
> Thank you very much,
> Rachun
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Memory Usage on Windows Os while indexing

2014-01-20 Thread onetwothree
Facts:


OS Windows server 2008

4 Cpu
8 GB Ram

Tomcat Service version 7.0 (64 bit)

Only running Solr
Optional JVM parameters set xmx = 3072, xms = 1024
Solr version 4.5.0.

One Core instance (both for querying and indexing)
*Schema config:*
minGramSize="2" maxGramSize="20"
most of the fields are stored = "true" (required)

*Solr config:*
ramBufferSizeMB: 100
maxIndexingThreads: 8
directoryFactory: MMapDirectory
autocommit: maxdocs 1, maxtime 15000, opensearcher false
cache (defaults): 
filtercache initialsize:512 size: 512 autowarm: 0
queryresultcache initialsize:512 size: 512 autowarm: 0
documentcache initialsize:512 size: 512 autowarm: 0

Problem description:


We're using a .Net Service (based on Solr.Net) for updating and inserting
documents on a single Solr Core instance. The size of documents sent to Solr
vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
or multiple threads. The current size of the Solr Index is about 15GB.

The indexing service is running around 4 a 5 hours per day, to complete all
inserts and updates to Solr. While the indexing process is running the
Tomcat process memory usage keeps growing up to > 7GB Ram (using Process
Explorer monitor tool) and does not reduce, even after 24 hours. After a
restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
process, the memory usage of Tomcat seems ok, memory consumption is in range
of defined jvm startup params (see image).

So it seems that filesystem buffers are consuming all the leftover memory??,
and don't release memory, even after a quite amount of time? Is there a way
handle this behaviour, in a way that not all memory is consumed? Are there
other alternatives? Best practices?

 

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond, 

I keep trying to encode the '&'  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Im using SolrPHPClient. 
Please suggest me. 

Thank you very much, 
Rachun 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond, 

I keep trying to encode the '&'  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Im using PHPSolrClient.
Please suggest me. 

Thank you very much, 
Rachun




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112252.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond,

I keep trying to encode the '&'  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Please suggest me.


Thank you very much,
Rachun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112251.html
Sent from the Solr - User mailing list archive at Nabble.com.