Re: Cluster dumper crashes when run on a large dataset

gaurav redkar Thu, 03 Nov 2011 23:59:58 -0700

Thanks a lot Paritosh.. i really appreciate ur help.

On Fri, Nov 4, 2011 at 12:15 PM, Paritosh Ranjan <[email protected]> wrote:


> pseudo code:
>
> while(has next record in clustered ouput)
> {
> readNextRecord();
> extractVectorAndClusterIdFromR**ecord();
>    if(directory of name ClusterId does not exist){
>        create directory of name clusterId
>    }
> writeVectorInDirectoryNamedClu**sterId();
>
> }
>
> On 04-11-2011 12:09, gaurav redkar wrote:
>
>> Thanks a lot for ur help. Yes i will be running it on a hadoop cluster.
>> Can
>> u elaborate a bit on writing to file system incrementally..?
>>
>> On Fri, Nov 4, 2011 at 11:51 AM, Paritosh Ranjan<[email protected]>
>>  wrote:
>>
>>  Such big data would need to run on Hadoop cluster.
>>>
>>> Right now, I think there is no utility which can help you collect data in
>>> the form you want. You will have to read it line by line, group vectors
>>> belonging to similar cluster. Would be good if you can write it on file
>>> system incrementally, as this would get rid of memory problem.
>>>
>>> Or, try CanopyDriver with clusterFilter>  0 , which might help in
>>> reducing
>>> the number of clusters that you are getting as output, which, in turn,
>>> might help in less memory usage.
>>>
>>>
>>> On 04-11-2011 11:43, gaurav redkar wrote:
>>>
>>>  Actually i have to run the meanshift algorithm on a large dataset for my
>>>> project. the clusterdumper facility works on smaller data sets .
>>>>
>>>> But my project will mostly include large-scale data (size will mostly
>>>> extend to gigabytes). So i need to modify the clusterdumper facility to
>>>> work on the such dataset. Also the vector is densely populated.
>>>>
>>>> i probably need to read each file from pointsDir one at a tym while
>>>> constructing the "result" map. Any pointers as to how do i do it.??
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan<[email protected]>
>>>>  wrote:
>>>>
>>>>  Reducing dimension (drastically, try less than 100 if functionality
>>>>
>>>>> allows
>>>>> this) can be a solution.
>>>>>
>>>>> Which vector implementation are you using? If the vectors are sparsely
>>>>> populated ( have lots of uninitialized/unused dimensions) , you can use
>>>>> RandomAccessSparseVector or SequentialAccessSparseVector, which will
>>>>> populate only the dimensions which you are using. This can also
>>>>> decrease
>>>>> memory consumption.
>>>>>
>>>>>
>>>>> On 04-11-2011 11:19, gaurav redkar wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>>> yes Paritosh..even i think the same. actually i am using a test data
>>>>>> set
>>>>>> that has 5000 tuples with 1000 dimensions each.  the thing is der are
>>>>>> too
>>>>>> many files created in the pointsDir folder and i think the program
>>>>>> tries
>>>>>> to
>>>>>> open a path to all d files(i.e. read all the files in memory at once).
>>>>>> Is
>>>>>> my interpretation correct.?? Also how do i go about fixing it..?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<[email protected]>
>>>>>>  wrote:
>>>>>>
>>>>>>  Reading point is keeping everything in memory which might have
>>>>>> crashed
>>>>>>
>>>>>>  it.
>>>>>>> pointList.add(record.********getSecond());
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Your dataset size is 40 MB but the vectors might be too large. How
>>>>>>> many
>>>>>>> dimensions are you having in your Vector?
>>>>>>>
>>>>>>>
>>>>>>> On 04-11-2011 10:57, gaurav redkar wrote:
>>>>>>>
>>>>>>>  Hello,
>>>>>>>
>>>>>>>  I am in  a fix with the Clusterdumper utility. The clusterdump
>>>>>>>> utility
>>>>>>>> crashes when it tries to output the clusters by outputting an out of
>>>>>>>> memory
>>>>>>>> exception: java heap space.
>>>>>>>>
>>>>>>>> when i checked the error stack, it seems that the program crashed in
>>>>>>>> readPoints() function. i guess it is unable to build the "result"
>>>>>>>> map.
>>>>>>>> Any
>>>>>>>> idea how do i fix this.??
>>>>>>>>
>>>>>>>> I am working on a dataset of size 40mb. I had tried increaseing the
>>>>>>>> heap
>>>>>>>> space but with no luck.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Gaurav
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----
>>>>>>>> No virus found in this message.
>>>>>>>> Checked by AVG - www.avg.com
>>>>>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date:
>>>>>>>> 11/03/11
>>>>>>>>
>>>>>>>>
>>>>>>>>  -----
>>>>>>>>
>>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com
>>>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date:
>>>>>> 11/03/11
>>>>>>
>>>>>>
>>>>>>  -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>>>
>>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>
>
>

Re: Cluster dumper crashes when run on a large dataset

Reply via email to